> ## Documentation Index
> Fetch the complete documentation index at: https://docs.comfy.org/llms.txt
> Use this file to discover all available pages before exploring further.

# WanSoundImageToVideo - ComfyUI Built-in Node Documentation

> Complete documentation for the WanSoundImageToVideo node in ComfyUI. Learn its inputs, outputs, parameters and usage.

The WanSoundImageToVideo node generates video content from images with optional audio conditioning. It takes positive and negative conditioning prompts along with a VAE model to create video latents, and can incorporate reference images, audio encoding, control videos, and motion references to guide the video generation process.

## Inputs

| Parameter              | Description                                                                                                                                                                                                                             | Data Type          | Required | Range                 |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | -------- | --------------------- |
| `positive`             | Positive conditioning prompts that guide what content should appear in the generated video                                                                                                                                              | CONDITIONING       | Yes      | -                     |
| `negative`             | Negative conditioning prompts that specify what content should be avoided in the generated video                                                                                                                                        | CONDITIONING       | Yes      | -                     |
| `vae`                  | VAE model used for encoding and decoding the video latent representations                                                                                                                                                               | VAE                | Yes      | -                     |
| `width`                | Width of the output video in pixels (default: 832, must be divisible by 16)                                                                                                                                                             | INT                | Yes      | 16 to MAX\_RESOLUTION |
| `height`               | Height of the output video in pixels (default: 480, must be divisible by 16)                                                                                                                                                            | INT                | Yes      | 16 to MAX\_RESOLUTION |
| `length`               | Number of frames in the generated video (default: 77, must be divisible by 4)                                                                                                                                                           | INT                | Yes      | 1 to MAX\_RESOLUTION  |
| `batch_size`           | Number of videos to generate simultaneously (default: 1)                                                                                                                                                                                | INT                | Yes      | 1 to 4096             |
| `audio_encoder_output` | Optional audio encoding that can influence the video generation based on sound characteristics. When provided, the audio features are interpolated and used to condition the video generation.                                          | AUDIOENCODEROUTPUT | No       | -                     |
| `ref_image`            | Optional reference image that provides visual guidance for the video content. The image is upscaled to match the specified width and height, then encoded into a latent representation.                                                 | IMAGE              | No       | -                     |
| `control_video`        | Optional control video that guides the motion and structure of the generated video. The video is upscaled and encoded, then used to condition the output. Only the first `length` frames are used.                                      | IMAGE              | No       | -                     |
| `ref_motion`           | Optional motion reference that provides guidance for movement patterns in the video. If the input has more than 73 frames, only the last 73 are used. If fewer than 73 frames are provided, the sequence is padded with neutral frames. | IMAGE              | No       | -                     |

## Outputs

| Output Name | Description                                                                                                                                                                                                                   | Data Type    |
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ |
| `positive`  | Processed positive conditioning that has been modified for video generation, including audio embeddings, reference latents, motion references, and control video conditioning                                                 | CONDITIONING |
| `negative`  | Processed negative conditioning that has been modified for video generation, including audio embeddings (set to zero), reference latents, motion references, and control video conditioning                                   | CONDITIONING |
| `latent`    | Generated video representation in latent space that can be decoded into final video frames. The latent tensor has shape \[batch\_size, 16, latent\_t, height/8, width/8] where latent\_t is derived from the length parameter | LATENT       |

> This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! [Edit on GitHub](https://github.com/Comfy-Org/embedded-docs/blob/main/comfyui_embedded_docs/docs/WanSoundImageToVideo/en.md)

***

**Source fingerprint (SHA-256):** `08aa558c23990f7efae9adede91715bf40afca4b50e416a6cadfd18c3d607b75`