One-Sentence Image Matting! DiffSynth Open Sources Text-Guided Image Layer Separation Model

Community Article Published January 14, 2026

We have trained and open-sourced a "text-guided image layer separation model." Given an input image, you can describe in text the layers you wish to extract, and the model will automatically separate the corresponding layers while inpainting and reconstructing the occluded regions.

Model link: https://huggingface.co/DiffSynth-Studio/Qwen-Image-Layered-Control

Technical Approach

Base Model

Recently, the Qwen-Image team open-sourced the model Qwen-Image-Layered (https://modelscope.cn/models/Qwen/Qwen-Image-Layered), which is capable of decomposing an image into multiple layers.

This is a highly innovative model. However, we recognized that its output lacks controllability—we cannot control the content of each individual layer produced by the model. Therefore, we decided to build a new controllable generation model based on this foundation, enabling it to split specific image content layers according to textual descriptions provided as input.

Dataset

The open-source community has already accumulated rich datasets with layered image annotations. We adopted the dataset artplus/PrismLayersPro (https://modelscope.cn/datasets/artplus/PrismLayersPro).

This dataset contains approximately twenty thousand images along with their respective layers and associated textual descriptions—sufficient data volume for training a controllable generation model.

Input and Output Format

The base model Qwen-Image-Layered takes one image and one text prompt as input and outputs multiple layers.

Regarding the input: originally, the text input was a description of the entire image content, which was only used in subsequent editing models and thus redundant in the currently released version. We repurposed this text field to specify "the content of the layer to be separated."

For the output: the original model generated multiple layers simultaneously, significantly increasing inference computational cost. To improve efficiency, we modified the output to produce only a single layer at a time—specifically, the layer relevant to the given text description.

Model Performance

We trained the model for three days on 8*A100 GPUs until convergence. Below are some inference results.

Example One

This example comes from the training dataset, where the original data splits the image into just four layers: text, skeleton, clouds, and background. Our model enables finer-grained decomposition and even allows control over which parts of the text to extract.

Input image:

Prompt	Output Image
A solid, uniform color with no distinguishable features or objects
Cloud
A cartoon skeleton character wearing a purple hat and holding a gift box
A purple hat and a head
A gift box
Text 'TRICK'
Text 'TRICK OR'
Text 'TRICK OR TREAT'

Example Two

This example features an adorable anime-style girl. We tested the model's performance in anime scenarios to assist ACG artists in creative workflows.

Input image:

Prompt	Output Image
Blue sky, white clouds, a garden filled with colorful flowers
Girl, flower wreath, kitten
Girl, kitten
Colorful, intricate flower wreath

Example Three

This example is a Chinese-language poster. Although our training data does not include Chinese text, the trained model still inherits Chinese comprehension capabilities from the base model.

Input image:

Prompt	Output Image
A clear blue sky and a turbulent sea
A seagull
Text "向往的生活"
Text "生活"

Model Inference and Training

The model is trained based on DiffSynth-Studio (https://github.com/modelscope/DiffSynth-Studio). Please install DiffSynth-Studio first:

git clone https://github.com/modelscope/DiffSynth-Studio.git  
cd DiffSynth-Studio
pip install -e .

Model inference:

from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
from PIL import Image
import torch, requests

pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
)
prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw
input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024))
input_image.save("image_input.png")
images = pipe(
    prompt,
    seed=0,
    num_inference_steps=30, cfg_scale=4,
    height=1024, width=1024,
    layer_input_image=input_image,
    layer_num=0,
)
images[0].save("image.png")

Model training:

Full training script: https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/qwen_image/model_training/full/Qwen-Image-Layered-Control.sh
LoRA training script: https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/qwen_image/model_training/lora/Qwen-Image-Layered-Control.sh
Sample dataset: https://modelscope.cn/datasets/DiffSynth-Studio/example_image_dataset/tree/master/layer

Future Work

We are currently training another mysterious controllable generation model—stay tuned!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote