> ## Documentation Index
> Fetch the complete documentation index at: https://docs.comfy.org/llms.txt
> Use this file to discover all available pages before exploring further.

# HunyuanVideo15ImageToVideo - ComfyUI Built-in Node Documentation

> Complete documentation for the HunyuanVideo15ImageToVideo node in ComfyUI. Learn its inputs, outputs, parameters and usage.

> This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! [Edit on GitHub](https://github.com/Comfy-Org/embedded-docs/blob/main/comfyui_embedded_docs/docs/HunyuanVideo15ImageToVideo/en.md)

The HunyuanVideo15ImageToVideo node prepares conditioning and latent space data for video generation based on the HunyuanVideo 1.5 model. It creates an initial latent representation for a video sequence and can optionally integrate a starting image or a CLIP vision output to guide the generation process.

## Inputs

| Parameter            | Data Type            | Required | Range                           | Description                                                                                                                                                                             |
| -------------------- | -------------------- | -------- | ------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `positive`           | CONDITIONING         | Yes      | -                               | The positive conditioning prompts that describe what the video should contain.                                                                                                          |
| `negative`           | CONDITIONING         | Yes      | -                               | The negative conditioning prompts that describe what the video should avoid.                                                                                                            |
| `vae`                | VAE                  | Yes      | -                               | The VAE (Variational Autoencoder) model used to encode the starting image into the latent space.                                                                                        |
| `width`              | INT                  | No       | 16 to MAX\_RESOLUTION, step: 16 | The width of the output video frames in pixels. Must be divisible by 16. (default: 848)                                                                                                 |
| `height`             | INT                  | No       | 16 to MAX\_RESOLUTION, step: 16 | The height of the output video frames in pixels. Must be divisible by 16. (default: 480)                                                                                                |
| `length`             | INT                  | No       | 1 to MAX\_RESOLUTION, step: 4   | The total number of frames in the video sequence. Must be a multiple of 4. (default: 33)                                                                                                |
| `batch_size`         | INT                  | No       | 1 to 4096                       | The number of video sequences to generate in a single batch. (default: 1)                                                                                                               |
| `start_image`        | IMAGE                | No       | -                               | An optional starting image to initialize the video generation. If provided, it is encoded and used to condition the first frames. Only the first `length` frames of the image are used. |
| `clip_vision_output` | CLIP\_VISION\_OUTPUT | No       | -                               | Optional CLIP vision embeddings to provide additional visual conditioning for the generation.                                                                                           |

**Note:** When a `start_image` is provided, it is automatically resized to match the specified `width` and `height` using bilinear interpolation. The first `length` frames of the image batch are used. The encoded image is then added to both the `positive` and `negative` conditioning as a `concat_latent_image` with a corresponding `concat_mask`. The mask is set to 0.0 for the frames covered by the starting image and 1.0 for the remaining frames.

## Outputs

| Output Name | Data Type    | Description                                                                                                      |
| ----------- | ------------ | ---------------------------------------------------------------------------------------------------------------- |
| `positive`  | CONDITIONING | The modified positive conditioning, which may now include the encoded starting image or CLIP vision output.      |
| `negative`  | CONDITIONING | The modified negative conditioning, which may now include the encoded starting image or CLIP vision output.      |
| `latent`    | LATENT       | An empty latent tensor with dimensions configured for the specified batch size, video length, width, and height. |

***

**Source fingerprint (SHA-256):** `383b965a2e67c3643a13991ea5969c4d31ce17e48a57a400f89974f64e4b1e04`