WanImageToVideo - ComfyUI Built-in Node Documentation

The WanImageToVideo node prepares conditioning and latent representations for video generation tasks. It creates an empty latent space for video generation and can optionally incorporate starting images and CLIP vision outputs to guide the video generation process. The node modifies both positive and negative conditioning inputs based on the provided image and vision data.

Inputs

Parameter	Description	Data Type	Required	Range
`positive`	Positive conditioning input for guiding the generation	CONDITIONING	Yes	-
`negative`	Negative conditioning input for guiding the generation	CONDITIONING	Yes	-
`vae`	VAE model for encoding images to latent space	VAE	Yes	-
`width`	Width of the output video (default: 832, step: 16)	INT	Yes	16 to MAX_RESOLUTION
`height`	Height of the output video (default: 480, step: 16)	INT	Yes	16 to MAX_RESOLUTION
`length`	Number of frames in the video (default: 81, step: 4)	INT	Yes	1 to MAX_RESOLUTION
`batch_size`	Number of videos to generate in a batch (default: 1)	INT	Yes	1 to 4096
`clip_vision_output`	Optional CLIP vision output for additional conditioning	CLIP_VISION_OUTPUT	No	-
`start_image`	Optional starting image to initialize the video generation. When provided, the image is resized to match the specified width and height, and the first frames of the video are initialized from this image. The remaining frames are filled with neutral gray (0.5) values.	IMAGE	No	-

Note: When start_image is provided, the node encodes the image sequence using the VAE and applies a mask to the conditioning inputs. The mask covers all frames except those initialized by the starting image, allowing the generation to build upon the provided image. The clip_vision_output parameter, when provided, adds vision-based conditioning to both positive and negative inputs.

Outputs

Output Name	Description	Data Type
`positive`	Modified positive conditioning with image and vision data incorporated	CONDITIONING
`negative`	Modified negative conditioning with image and vision data incorporated	CONDITIONING
`latent`	Empty latent space tensor ready for video generation, with shape [batch_size, 16, ((length-1)//4)+1, height//8, width//8]	LATENT

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): 9cac4f27f5ec2e0d5247fab78acb00a68eb6317dd747d6f6f46b065240f64a8b

​Inputs

​Outputs

Inputs

Outputs