Z-Image (造相) is a powerful and highly efficient image generation model with 6B parameters, developed by Alibaba’s Tongyi Lab. It uses a Scalable Single-Stream DiT (S3-DiT) architecture where text, visual semantic tokens, and image VAE tokens are concatenated at the sequence level to serve as a unified input stream, maximizing parameter efficiency. Z-Image (Base) is the non-distilled foundation model designed for community-driven fine-tuning and custom development. Model Highlights:Documentation Index
Fetch the complete documentation index at: https://docs.comfy.org/llms.txt
Use this file to discover all available pages before exploring further.
- Photorealistic Quality: Delivers strong photorealistic image generation while maintaining excellent aesthetic quality
- Accurate Bilingual Text Rendering: Excels at accurately rendering complex Chinese and English text
- Prompt Enhancing & Reasoning: Prompt Enhancer empowers the model with reasoning capabilities
- Fine-tuning Ready: Ideal base model for custom training and adaptation
Z-Image text-to-image workflow
Download Workflow
Download the Z-Image text-to-image workflow JSON file.
Run on ComfyUI Cloud
Run this workflow directly on ComfyUI Cloud.
Z-Image model downloads
qwen_3_4b.safetensors
Text encoder for Z-Image.
z_image_bf16.safetensors
Diffusion model for Z-Image.
ae.safetensors
VAE for Z-Image.