TextEncodeHunyuanVideo_ImageToVideo - ComfyUI Built-in Node Documentation

The TextEncodeHunyuanVideo_ImageToVideo node creates conditioning data for video generation by combining text prompts with image embeddings. It uses a CLIP model to process both the text input and visual information from a CLIP vision output, then generates tokens that blend these two sources according to the specified image interleave setting.

Inputs

Parameter	Description	Data Type	Required	Range
`clip`	The CLIP model used for tokenization and encoding	CLIP	Yes	-
`clip_vision_output`	The visual embeddings from a CLIP vision model that provide image context	CLIP_VISION_OUTPUT	Yes	-
`prompt`	The text description to guide the video generation. Supports multiline input and dynamic prompts. The prompt is formatted using a template that asks the model to describe the video based on the reference image, covering aspects like main content, object details, actions, background, and camera angles.	STRING	Yes	-
`image_interleave`	How much the image influences things vs the text prompt. Higher number means more influence from the text prompt. (default: 2)	INT	Yes	1-512

Outputs

Output Name	Description	Data Type
`CONDITIONING`	The conditioning data that combines text and image information for video generation	CONDITIONING

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): ecc190941e8d355bc6e6e4b5b7938d54a79e70a7ff0049157dab30b720605e6a

HunyuanVideo15SuperResolution - ComfyUI Built-in Node Documentation

InstructPixToPixConditioning - ComfyUI Built-in Node Documentation

​Inputs

​Outputs

Inputs

Outputs