HunyuanVideo15ImageToVideo - ComfyUI Built-in Node Documentation

The HunyuanVideo15ImageToVideo node prepares conditioning and latent space data for video generation based on the HunyuanVideo 1.5 model. It creates an initial latent representation for a video sequence and can optionally integrate a starting image or a CLIP vision output to guide the generation process.

Inputs

Parameter	Description	Data Type	Required	Range
`positive`	The positive conditioning prompts that describe what the video should contain.	CONDITIONING	Yes	-
`negative`	The negative conditioning prompts that describe what the video should avoid.	CONDITIONING	Yes	-
`vae`	The VAE (Variational Autoencoder) model used to encode the starting image into the latent space.	VAE	Yes	-
`width`	The width of the output video frames in pixels. Must be divisible by 16. (default: 848)	INT	No	16 to MAX_RESOLUTION, step: 16
`height`	The height of the output video frames in pixels. Must be divisible by 16. (default: 480)	INT	No	16 to MAX_RESOLUTION, step: 16
`length`	The total number of frames in the video sequence. Must be a multiple of 4. (default: 33)	INT	No	1 to MAX_RESOLUTION, step: 4
`batch_size`	The number of video sequences to generate in a single batch. (default: 1)	INT	No	1 to 4096
`start_image`	An optional starting image to initialize the video generation. If provided, it is encoded and used to condition the first frames. Only the first `length` frames of the image are used.	IMAGE	No	-
`clip_vision_output`	Optional CLIP vision embeddings to provide additional visual conditioning for the generation.	CLIP_VISION_OUTPUT	No	-

Note: When a start_image is provided, it is automatically resized to match the specified width and height using bilinear interpolation. The first length frames of the image batch are used. The encoded image is then added to both the positive and negative conditioning as a concat_latent_image with a corresponding concat_mask. The mask is set to 0.0 for the frames covered by the starting image and 1.0 for the remaining frames.

Outputs

Output Name	Description	Data Type
`positive`	The modified positive conditioning, which may now include the encoded starting image or CLIP vision output.	CONDITIONING
`negative`	The modified negative conditioning, which may now include the encoded starting image or CLIP vision output.	CONDITIONING
`latent`	An empty latent tensor with dimensions configured for the specified batch size, video length, width, and height.	LATENT

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): 383b965a2e67c3643a13991ea5969c4d31ce17e48a57a400f89974f64e4b1e04

​Inputs

​Outputs

Inputs

Outputs