This guide will help you understand the concept of text-to-image in AI art generation and complete a text-to-image workflow in ComfyUI
This guide aims to introduce you to ComfyUI’s text-to-image workflow and help you understand the functionality and usage of various ComfyUI nodes.
In this document, we will:
We’ll start by running a text-to-image workflow, followed by explanations of related concepts. Please choose the relevant sections based on your needs.
Text to Image is a fundamental process in AI art generation that creates images from text descriptions, with diffusion models at its core.
The text-to-image process requires the following elements:
This text-to-image generation process can be simply understood as telling your requirements (positive and negative prompts) to an artist (the image model), who then creates what you want based on these requirements.
Ensure you have at least one SD1.5 model file in your ComfyUI/models/checkpoints
folder, such as v1-5-pruned-emaonly-fp16.safetensors
If you haven’t installed it yet, please refer to the model installation section in Getting Started with ComfyUI AI Art Generation.
Download the image below and drag it into ComfyUI to load the workflow:
Images containing workflow JSON in their metadata can be directly dragged into ComfyUI or loaded using the menu Workflows
-> Open (ctrl+o)
.
After installing the image model, follow the steps in the image below to load the model and generate your first image
Follow these steps according to the image numbers:
Queue
button or use the shortcut Ctrl + Enter
to execute image generationAfter the process completes, you should see the resulting image in the Save Image node interface, which you can right-click to save locally
seed
parameter, so each generation will produce different resultsTry modifying the text in the CLIP Text Encoder
The Positive
connection to the KSampler node represents positive prompts, while the Negative
connection represents negative prompts
Here are some basic prompting principles for the SD1.5 model:
,
(golden hour:1.2)
to increase the weight of specific keywords, making them more likely to appear in the image. 1.2
is the weight, golden hour
is the keywordmasterpiece, best quality, 4k
to improve generation qualityHere are several prompt examples you can try, or use your own prompts for generation:
1. Anime Style
Positive prompts:
Negative prompts:
2. Realistic Style
Positive prompts:
Negative prompts:
3. Specific Artist Style
Positive prompts:
Negative prompts:
The entire text-to-image process can be understood as a reverse diffusion process. The v1-5-pruned-emaonly-fp16.safetensors we downloaded is a pre-trained model that can generate target images from pure Gaussian noise. We only need to input our prompts, and it can generate target images through denoising random noise.
We need to understand two concepts:
If you want to learn more about diffusion models, you can read these papers:
This node is typically used to load the image generation model. A checkpoint
usually contains three components: MODEL (UNet)
, CLIP
, and VAE
MODEL (UNet)
: The UNet model responsible for noise prediction and image generation during the diffusion processCLIP
: The text encoder that converts our text prompts into vectors that the model can understand, as the model cannot directly understand text promptsVAE
: The Variational AutoEncoder that converts images between pixel space and latent space, as diffusion models work in latent space while our images are in pixel spaceDefines a latent space that outputs to the KSampler node. The Empty Latent Image node constructs a pure noise latent space
You can think of its function as defining the canvas size, which determines the dimensions of our final generated image
Used to encode prompts, which are your requirements for the image
Positive
condition input connected to the KSampler node represents positive prompts (elements you want in the image)Negative
condition input connected to the KSampler node represents negative prompts (elements you don’t want in the image)The prompts are encoded into semantic vectors by the CLIP
component from the Load Checkpoint
node and output as conditions to the KSampler node
The KSampler is the core of the entire workflow, where the entire noise denoising process occurs, ultimately outputting a latent space image
Here’s an explanation of the KSampler node parameters:
Parameter Name | Description | Function |
---|---|---|
model | Diffusion model used for denoising | Determines the style and quality of generated images |
positive | Positive prompt condition encoding | Guides generation to include specified elements |
negative | Negative prompt condition encoding | Suppresses unwanted content |
latent_image | Latent space image to be denoised | Serves as the input carrier for noise initialization |
seed | Random seed for noise generation | Controls generation randomness |
control_after_generate | Seed control mode after generation | Determines seed variation pattern in batch generation |
steps | Number of denoising iterations | More steps mean finer details but longer processing time |
cfg | Classifier-free guidance scale | Controls prompt constraint strength (too high leads to overfitting) |
sampler_name | Sampling algorithm name | Determines the mathematical method for denoising path |
scheduler | Scheduler type | Controls noise decay rate and step size allocation |
denoise | Denoising strength coefficient | Controls noise strength added to latent space, 0.0 preserves original input features, 1.0 is complete noise |
In the KSampler node, the latent space uses seed
as an initialization parameter to construct random noise, and semantic vectors Positive
and Negative
are input as conditions to the diffusion model
Then, based on the number of denoising steps specified by the steps
parameter, denoising is performed. Each denoising step uses the denoising strength coefficient specified by the denoise
parameter to denoise the latent space and generate a new latent space image
Converts the latent space image output from the KSampler into a pixel space image
Previews and saves the decoded image from latent space to the local ComfyUI/output
folder
SD1.5 (Stable Diffusion 1.5) is an AI image generation model developed by Stability AI. It’s the foundational version of the Stable Diffusion series, trained on 512×512 resolution images, making it particularly good at generating images at this resolution. With a size of about 4GB, it runs smoothly on consumer-grade GPUs (e.g., 6GB VRAM). Currently, SD1.5 has a rich ecosystem, supporting various plugins (like ControlNet, LoRA) and optimization tools. As a milestone model in AI art generation, SD1.5 remains the best entry-level choice thanks to its open-source nature, lightweight architecture, and rich ecosystem. Although newer versions like SDXL/SD3 have been released, its value for consumer-grade hardware remains unmatched.
Model Advantages:
Model Limitations: