> ## Documentation Index
> Fetch the complete documentation index at: https://docs.comfy.org/llms.txt
> Use this file to discover all available pages before exploring further.

# TextEncodeHunyuanVideo_ImageToVideo - ComfyUI Built-in Node Documentation

> Complete documentation for the TextEncodeHunyuanVideo_ImageToVideo node in ComfyUI. Learn its inputs, outputs, parameters and usage.

> This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! [Edit on GitHub](https://github.com/Comfy-Org/embedded-docs/blob/main/comfyui_embedded_docs/docs/TextEncodeHunyuanVideo_ImageToVideo/en.md)

The TextEncodeHunyuanVideo\_ImageToVideo node creates conditioning data for video generation by combining text prompts with image embeddings. It uses a CLIP model to process both the text input and visual information from a CLIP vision output, then generates tokens that blend these two sources according to the specified image interleave setting.

## Inputs

| Parameter            | Data Type            | Required | Range | Description                                                                                                                                                                                                                                                                                                    |
| -------------------- | -------------------- | -------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `clip`               | CLIP                 | Yes      | -     | The CLIP model used for tokenization and encoding                                                                                                                                                                                                                                                              |
| `clip_vision_output` | CLIP\_VISION\_OUTPUT | Yes      | -     | The visual embeddings from a CLIP vision model that provide image context                                                                                                                                                                                                                                      |
| `prompt`             | STRING               | Yes      | -     | The text description to guide the video generation. Supports multiline input and dynamic prompts. The prompt is formatted using a template that asks the model to describe the video based on the reference image, covering aspects like main content, object details, actions, background, and camera angles. |
| `image_interleave`   | INT                  | Yes      | 1-512 | How much the image influences things vs the text prompt. Higher number means more influence from the text prompt. (default: 2)                                                                                                                                                                                 |

## Outputs

| Output Name    | Data Type    | Description                                                                         |
| -------------- | ------------ | ----------------------------------------------------------------------------------- |
| `CONDITIONING` | CONDITIONING | The conditioning data that combines text and image information for video generation |

***

**Source fingerprint (SHA-256):** `ecc190941e8d355bc6e6e4b5b7938d54a79e70a7ff0049157dab30b720605e6a`