Skip to main content
The HunyuanVideo15ImageToVideo node prepares conditioning and latent space data for video generation based on the HunyuanVideo 1.5 model. It creates an initial latent representation for a video sequence and can optionally integrate a starting image or a CLIP vision output to guide the generation process.

Inputs

ParameterDescriptionData TypeRequiredRange
positiveThe positive conditioning prompts that describe what the video should contain.CONDITIONINGYes-
negativeThe negative conditioning prompts that describe what the video should avoid.CONDITIONINGYes-
vaeThe VAE (Variational Autoencoder) model used to encode the starting image into the latent space.VAEYes-
widthThe width of the output video frames in pixels. Must be divisible by 16. (default: 848)INTNo16 to MAX_RESOLUTION, step: 16
heightThe height of the output video frames in pixels. Must be divisible by 16. (default: 480)INTNo16 to MAX_RESOLUTION, step: 16
lengthThe total number of frames in the video sequence. Must be a multiple of 4. (default: 33)INTNo1 to MAX_RESOLUTION, step: 4
batch_sizeThe number of video sequences to generate in a single batch. (default: 1)INTNo1 to 4096
start_imageAn optional starting image to initialize the video generation. If provided, it is encoded and used to condition the first frames. Only the first length frames of the image are used.IMAGENo-
clip_vision_outputOptional CLIP vision embeddings to provide additional visual conditioning for the generation.CLIP_VISION_OUTPUTNo-
Note: When a start_image is provided, it is automatically resized to match the specified width and height using bilinear interpolation. The first length frames of the image batch are used. The encoded image is then added to both the positive and negative conditioning as a concat_latent_image with a corresponding concat_mask. The mask is set to 0.0 for the frames covered by the starting image and 1.0 for the remaining frames.

Outputs

Output NameDescriptionData Type
positiveThe modified positive conditioning, which may now include the encoded starting image or CLIP vision output.CONDITIONING
negativeThe modified negative conditioning, which may now include the encoded starting image or CLIP vision output.CONDITIONING
latentAn empty latent tensor with dimensions configured for the specified batch size, video length, width, and height.LATENT
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): 383b965a2e67c3643a13991ea5969c4d31ce17e48a57a400f89974f64e4b1e04