Inputs
| Parameter | Description | Data Type | Required | Range |
|---|---|---|---|---|
mode | The audio input mode. "single_speaker" uses one audio input. "two_speakers" enables inputs for a second speaker and corresponding masks. | COMBO | Yes | "single_speaker""two_speakers" |
model | The base video diffusion model. | MODEL | Yes | - |
model_patch | The model patch containing audio projection layers. | MODELPATCH | Yes | - |
positive | The positive conditioning to guide the generation. | CONDITIONING | Yes | - |
negative | The negative conditioning to guide the generation. | CONDITIONING | Yes | - |
vae | The VAE used for encoding images to and from the latent space. | VAE | Yes | - |
width | The width of the output video in pixels. Must be divisible by 16. (default: 832) | INT | No | 16 - MAX_RESOLUTION |
height | The height of the output video in pixels. Must be divisible by 16. (default: 480) | INT | No | 16 - MAX_RESOLUTION |
length | The number of frames to generate. (default: 81) | INT | No | 1 - MAX_RESOLUTION |
clip_vision_output | Optional CLIP vision output for additional conditioning. | CLIPVISIONOUTPUT | No | - |
start_image | An optional starting image to initialize the video sequence. | IMAGE | No | - |
audio_encoder_output_1 | The primary audio encoder output containing features for the first speaker. | AUDIOENCODEROUTPUT | Yes | - |
motion_frame_count | Number of previous frames to use as motion context when extending a sequence. (default: 9) | INT | No | 1 - 33 |
audio_scale | A scaling factor applied to the audio conditioning. (default: 1.0) | FLOAT | No | -10.0 - 10.0 |
previous_frames | Optional previous video frames to extend from. | IMAGE | No | - |
audio_encoder_output_2 | The second audio encoder output. Required when mode is set to "two_speakers". | AUDIOENCODEROUTPUT | No | - |
mask_1 | Mask for the first speaker, required if using two audio inputs. | MASK | No | - |
mask_2 | Mask for the second speaker, required if using two audio inputs. | MASK | No | - |
- When
modeis set to"two_speakers", the parametersaudio_encoder_output_2,mask_1, andmask_2become required. - If
audio_encoder_output_2is provided, bothmask_1andmask_2must also be provided. - If
mask_1andmask_2are provided,audio_encoder_output_2must also be provided. - If
previous_framesis provided, it must contain at least as many frames as specified bymotion_frame_count.
Outputs
| Output Name | Description | Data Type |
|---|---|---|
model | The patched model with audio conditioning applied. | MODEL |
positive | The positive conditioning, potentially modified with additional context (e.g., start image, CLIP vision). | CONDITIONING |
negative | The negative conditioning, potentially modified with additional context. | CONDITIONING |
latent | The generated video sequence in latent space. | LATENT |
trim_image | The number of frames from the start of the motion context that should be trimmed when extending a sequence. | INT |
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub
Source fingerprint (SHA-256):
1ef125235ce5adb09972737d0e2863255315c536da718c7af230de1b4a7f53e2