Skip to main content
PixelDiT is NVIDIA’s pixel-space diffusion transformer for 1024px text-to-image generation. Unlike traditional diffusion models that operate in latent space, PixelDiT generates images directly in pixel space using a dual-level DiT architecture — a patch-level DiT combined with a pixel-level DiT — with MM-DiT fusion for joint attention between text and image tokens. Model Highlights:
  • VAE-free — generates directly in pixel space; no traditional VAE encode/decode
  • Dual-level DiT — patch-level DiT + pixel-level DiT for high-quality generation
  • Multi aspect ratio — 1024px base resolution with support for several aspect ratios
  • ~1.3B parameters — efficient enough for consumer GPUs
  • License: NSCLv1 (non-commercial research/evaluation only)
Related Links:

PixelDiT text-to-image workflow

PixelDiT text-to-image workflow

Download Workflow

Download JSON or search “PixelDiT” in Template Library
Make sure your ComfyUI is updated.Workflows in this guide can be found in the Workflow Templates. If you can’t find them in the template, your ComfyUI may be outdated. (Desktop version’s update will delay sometime)If nodes are missing when loading a workflow, possible reasons:
  1. You are not using the latest ComfyUI version (Nightly version)
  2. Some nodes failed to import at startup
The workflow consists of three main nodes:
  1. ResolutionSelector — choose your desired output resolution
  2. Text to Image (PixelDiT) subgraph — the core generation node with exposed controls for prompt, seed, model selection and resolution
  3. SaveImage — saves the generated image

Learn about Subgraph

This workflow uses Subgraph nodes for modular processing. Check out the Subgraph documentation to learn how to customize and extend the workflow.

Workflow controls

The exposed controls on the Text to Image (PixelDiT) subgraph node include:
ControlDescription
Positive PromptThe text prompt describing the image you want to generate
Negative PromptText describing what to avoid in the generated image
SeedRandom seed for reproducibility
UNet ModelPixelDiT model checkpoint selection
CLIP ModelText encoder model selection

Model downloads

PixelDiT uses two model files: a text encoder and the diffusion model.

Text Encoder

gemma_2_2b_it_elm_bf16.safetensors — Gemma-2-2B-IT text encoder

Diffusion Model

pixeldit_1300m_1024px_bf16.safetensors — PixelDiT 1300M 1024px diffusion model

Model storage location

📂 ComfyUI/
├── 📂 models/
│   ├── 📂 text_encoders/
│   │      └── gemma_2_2b_it_elm_bf16.safetensors
│   └── 📂 diffusion_models/
│          └── pixeldit_1300m_1024px_bf16.safetensors