PixelDiT ComfyUI Workflow Example

PixelDiT is NVIDIA’s pixel-space diffusion transformer for 1024px text-to-image generation. Unlike traditional diffusion models that operate in latent space, PixelDiT generates images directly in pixel space using a dual-level DiT architecture — a patch-level DiT combined with a pixel-level DiT — with MM-DiT fusion for joint attention between text and image tokens. Model Highlights:

VAE-free — generates directly in pixel space; no traditional VAE encode/decode
Dual-level DiT — patch-level DiT + pixel-level DiT for high-quality generation
Multi aspect ratio — 1024px base resolution with support for several aspect ratios
~1.3B parameters — efficient enough for consumer GPUs
License: NSCLv1 (non-commercial research/evaluation only)

Related Links:

PixelDiT text-to-image workflow

Download Workflow

Download JSON or search “PixelDiT” in Template Library

Portable or self deployed users
Desktop or Cloud users

Make sure your ComfyUI is updated.

Workflows in this guide can be found in the Workflow Templates. If you can’t find them in the template, your ComfyUI may be outdated. (Desktop version’s update will delay sometime)If nodes are missing when loading a workflow, possible reasons:

You are not using the latest ComfyUI version (Nightly version)
Some nodes failed to import at startup

The workflow consists of three main nodes:

ResolutionSelector — choose your desired output resolution
Text to Image (PixelDiT) subgraph — the core generation node with exposed controls for prompt, seed, model selection and resolution
SaveImage — saves the generated image

Learn about Subgraph

This workflow uses Subgraph nodes for modular processing. Check out the Subgraph documentation to learn how to customize and extend the workflow.

Workflow controls

The exposed controls on the Text to Image (PixelDiT) subgraph node include:

Control	Description
Positive Prompt	The text prompt describing the image you want to generate
Negative Prompt	Text describing what to avoid in the generated image
Seed	Random seed for reproducibility
UNet Model	PixelDiT model checkpoint selection
CLIP Model	Text encoder model selection

Model downloads

PixelDiT uses two model files: a text encoder and the diffusion model.

Text Encoder

gemma_2_2b_it_elm_bf16.safetensors — Gemma-2-2B-IT text encoder

Diffusion Model

pixeldit_1300m_1024px_bf16.safetensors — PixelDiT 1300M 1024px diffusion model

Model storage location

📂 ComfyUI/
├── 📂 models/
│   ├── 📂 text_encoders/
│   │      └── gemma_2_2b_it_elm_bf16.safetensors
│   └── 📂 diffusion_models/
│          └── pixeldit_1300m_1024px_bf16.safetensors

Lens ComfyUI workflow example

ComfyUI Ideogram 4.0 Open-Source Model Tutorial

​PixelDiT text-to-image workflow