Tencent has released open source video models for text to video, and image to video.

Image to Video

Using Hunyuan I2V, you can transform still images into fluid, high-quality videos. Try it with this starting frame.

Drag the video directly into ComfyUI to run the workflow.

Unified Image & Video Architecture

The “Dual-stream to Single-stream” Transformer efficiently fuses text, images, and motion information, enhancing consistency, quality, and alignment across the generated video frames.

Superior Text-Video-Image Alignment

The MLLM text encoder outperforms traditional encoders like CLIP and T5, offering better instruction following, detail capture, and complex reasoning when combined with image inputs.

Efficient Video Compression

A custom 3D VAE compresses videos into a compact latent space, preserving resolution and frame rate while reducing tokens, making Image-to-Video generation more efficient.

Requirements

Download the following models and place them in the locations specified below:

├── clip_vision/
│   └── llava_llama3_vision.safetensors
├── text_encoders/
│   ├── clip_l.safetensors
│   ├── llava_llama3_fp16.safetensors
│   └── llava_llama3_fp8_scaled.safetensors
├── vae/
│   └── hunyuan_video_vae_bf16.safetensors
└── diffusion_models/
    └── hunyuan_video_image_to_video_720p_bf16.safetensors