One-Sentence Image Matting! DiffSynth Open Sources Text-Guided Image Layer Separation Model
We have trained and open-sourced a "text-guided image layer separation model." Given an input image, you can describe in text the layers you wish to extract, and the model will automatically separate the corresponding layers while inpainting and reconstructing the occluded regions.
Model link: https://huggingface.co/DiffSynth-Studio/Qwen-Image-Layered-Control
Technical Approach
Base Model
Recently, the Qwen-Image team open-sourced the model Qwen-Image-Layered (https://modelscope.cn/models/Qwen/Qwen-Image-Layered), which is capable of decomposing an image into multiple layers.
This is a highly innovative model. However, we recognized that its output lacks controllability—we cannot control the content of each individual layer produced by the model. Therefore, we decided to build a new controllable generation model based on this foundation, enabling it to split specific image content layers according to textual descriptions provided as input.
Dataset
The open-source community has already accumulated rich datasets with layered image annotations. We adopted the dataset artplus/PrismLayersPro (https://modelscope.cn/datasets/artplus/PrismLayersPro).
This dataset contains approximately twenty thousand images along with their respective layers and associated textual descriptions—sufficient data volume for training a controllable generation model.
Input and Output Format
The base model Qwen-Image-Layered takes one image and one text prompt as input and outputs multiple layers.
Regarding the input: originally, the text input was a description of the entire image content, which was only used in subsequent editing models and thus redundant in the currently released version. We repurposed this text field to specify "the content of the layer to be separated."
For the output: the original model generated multiple layers simultaneously, significantly increasing inference computational cost. To improve efficiency, we modified the output to produce only a single layer at a time—specifically, the layer relevant to the given text description.
Model Performance
We trained the model for three days on 8*A100 GPUs until convergence. Below are some inference results.
Example One
This example comes from the training dataset, where the original data splits the image into just four layers: text, skeleton, clouds, and background. Our model enables finer-grained decomposition and even allows control over which parts of the text to extract.
Input image:
Example Two
This example features an adorable anime-style girl. We tested the model's performance in anime scenarios to assist ACG artists in creative workflows.
Input image:
| Prompt | Output Image |
|---|---|
| Blue sky, white clouds, a garden filled with colorful flowers | ![]() |
| Girl, flower wreath, kitten | ![]() |
| Girl, kitten | ![]() |
| Colorful, intricate flower wreath | ![]() |
Example Three
This example is a Chinese-language poster. Although our training data does not include Chinese text, the trained model still inherits Chinese comprehension capabilities from the base model.
Input image:
Model Inference and Training
The model is trained based on DiffSynth-Studio (https://github.com/modelscope/DiffSynth-Studio). Please install DiffSynth-Studio first:
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
Model inference:
from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
from PIL import Image
import torch, requests
pipe = QwenImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
],
processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
)
prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw
input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024))
input_image.save("image_input.png")
images = pipe(
prompt,
seed=0,
num_inference_steps=30, cfg_scale=4,
height=1024, width=1024,
layer_input_image=input_image,
layer_num=0,
)
images[0].save("image.png")
Model training:
- Full training script: https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/qwen_image/model_training/full/Qwen-Image-Layered-Control.sh
- LoRA training script: https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/qwen_image/model_training/lora/Qwen-Image-Layered-Control.sh
- Sample dataset: https://modelscope.cn/datasets/DiffSynth-Studio/example_image_dataset/tree/master/layer
Future Work
We are currently training another mysterious controllable generation model—stay tuned!





















