WanHuMoImageToVideo - ComfyUI Built-in Node Documentation

The WanHuMoImageToVideo node converts images to video sequences by generating latent representations for video frames. It processes conditioning inputs and can incorporate reference images and audio embeddings to influence the video generation. The node outputs modified conditioning data and latent representations suitable for video synthesis.

Inputs

Parameter	Description	Data Type	Required	Range
`positive`	Positive conditioning input that guides the video generation toward desired content	CONDITIONING	Yes	-
`negative`	Negative conditioning input that steers the video generation away from unwanted content	CONDITIONING	Yes	-
`vae`	VAE model used for encoding reference images into latent space	VAE	Yes	-
`width`	Width of the output video frames in pixels (default: 832, must be divisible by 16)	INT	Yes	16 to MAX_RESOLUTION
`height`	Height of the output video frames in pixels (default: 480, must be divisible by 16)	INT	Yes	16 to MAX_RESOLUTION
`length`	Number of frames in the generated video sequence (default: 97, must be such that (length - 1) is divisible by 4)	INT	Yes	1 to MAX_RESOLUTION
`batch_size`	Number of video sequences to generate simultaneously (default: 1)	INT	Yes	1 to 4096
`audio_encoder_output`	Optional audio encoding data that can influence video generation based on audio content	AUDIOENCODEROUTPUT	No	-
`ref_image`	Optional reference image used to guide the video generation style and content	IMAGE	No	-

Note: When a reference image is provided, it gets encoded and added to both positive and negative conditioning. When audio encoder output is provided, it gets processed and incorporated into the conditioning data. If neither is provided, zero-filled placeholder tensors are used for both reference latents and audio embeddings.

Outputs

Output Name	Description	Data Type
`positive`	Modified positive conditioning with reference image and/or audio embeddings incorporated	CONDITIONING
`negative`	Modified negative conditioning with reference image and/or audio embeddings incorporated	CONDITIONING
`latent`	Generated latent representation containing the video sequence data	LATENT

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): 4d28fe2617f25e72745d34bf2ec19aec2df6e89ad49eabe086ad045690f42d1f

​Inputs

​Outputs

Inputs

Outputs