Skip to main content
Qwen-Image-Layered is a model developed by Alibaba’s Qwen team that can decompose an image into multiple RGBA layers. This layered representation unlocks inherent editability: each layer can be independently manipulated without affecting other content. Key Features:
  • Inherent Editability: Each layer can be independently manipulated without affecting other content
  • High-Fidelity Elementary Operations: Supports resizing, repositioning, and recoloring with physical isolation of semantic components
  • Variable-Layer Decomposition: Not limited to a fixed number of layers - decompose into 3, 4, 8, or more layers as needed
  • Recursive Decomposition: Any layer can be further decomposed, enabling infinite decomposition depth
Related Links:

Qwen-Image-Layered workflow

Download JSON Workflow File

Make sure your ComfyUI is updated.Workflows in this guide can be found in the Workflow Templates. If you can’t find them in the template, your ComfyUI may be outdated. (Desktop version’s update will delay sometime)If nodes are missing when loading a workflow, possible reasons:
  1. You are not using the latest ComfyUI version (Nightly version)
  2. Some nodes failed to import at startup
text_encoders diffusion_models vae Model Storage Location
📂 ComfyUI/
├── 📂 models/
│   ├── 📂 text_encoders/
│   │      └── qwen_2.5_vl_7b_fp8_scaled.safetensors
│   ├── 📂 diffusion_models/
│   │      └── qwen_image_layered_bf16.safetensors
│   └── 📂 vae/
│          └── qwen_image_layered_vae.safetensors

FP8 version

By default we are using bf16, which requires high VRAM. For lower VRAM usage, you can use the fp8 version: Then update the Load Diffusion model node inside the Subgraph to use it.

Workflow settings

Sampler settings

This model is slow. The original sampling settings are steps: 50 and CFG: 4.0, which will at least double the generation time.

Input size

For input size, 640px is recommended. Use 1024px for high-resolution output.

Prompt (optional)

The text prompt is intended to describe the overall content of the input image—including elements that may be partially occluded (e.g., you may specify the text hidden behind a foreground object). It is not designed to control the semantic content of individual layers explicitly.