Inputs
| Parameter | Description | Data Type | Required | Range |
|---|---|---|---|---|
clip | The CLIP model used for tokenization and encoding | CLIP | Yes | - |
clip_vision_output | The visual embeddings from a CLIP vision model that provide image context | CLIP_VISION_OUTPUT | Yes | - |
prompt | The text description to guide the video generation. Supports multiline input and dynamic prompts. The prompt is formatted using a template that asks the model to describe the video based on the reference image, covering aspects like main content, object details, actions, background, and camera angles. | STRING | Yes | - |
image_interleave | How much the image influences things vs the text prompt. Higher number means more influence from the text prompt. (default: 2) | INT | Yes | 1-512 |
Outputs
| Output Name | Description | Data Type |
|---|---|---|
CONDITIONING | The conditioning data that combines text and image information for video generation | CONDITIONING |
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub
Source fingerprint (SHA-256):
ecc190941e8d355bc6e6e4b5b7938d54a79e70a7ff0049157dab30b720605e6a