> ## Documentation Index
> Fetch the complete documentation index at: https://docs.comfy.org/llms.txt
> Use this file to discover all available pages before exploring further.

# WanInfiniteTalkToVideo - ComfyUI Built-in Node Documentation

> Complete documentation for the WanInfiniteTalkToVideo node in ComfyUI. Learn its inputs, outputs, parameters and usage.

> This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! [Edit on GitHub](https://github.com/Comfy-Org/embedded-docs/blob/main/comfyui_embedded_docs/docs/WanInfiniteTalkToVideo/en.md)

The WanInfiniteTalkToVideo node generates video sequences from audio input. It uses a video diffusion model, conditioned on audio features extracted from one or two speakers, to produce a latent representation of a talking head video. The node can generate a new sequence or extend an existing one using previous frames for motion context.

## Inputs

| Parameter                | Data Type          | Required | Range                                    | Description                                                                                                                                  |
| ------------------------ | ------------------ | -------- | ---------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
| `mode`                   | COMBO              | Yes      | `"single_speaker"`<br />`"two_speakers"` | The audio input mode. `"single_speaker"` uses one audio input. `"two_speakers"` enables inputs for a second speaker and corresponding masks. |
| `model`                  | MODEL              | Yes      | -                                        | The base video diffusion model.                                                                                                              |
| `model_patch`            | MODELPATCH         | Yes      | -                                        | The model patch containing audio projection layers.                                                                                          |
| `positive`               | CONDITIONING       | Yes      | -                                        | The positive conditioning to guide the generation.                                                                                           |
| `negative`               | CONDITIONING       | Yes      | -                                        | The negative conditioning to guide the generation.                                                                                           |
| `vae`                    | VAE                | Yes      | -                                        | The VAE used for encoding images to and from the latent space.                                                                               |
| `width`                  | INT                | No       | 16 - MAX\_RESOLUTION                     | The width of the output video in pixels. Must be divisible by 16. (default: 832)                                                             |
| `height`                 | INT                | No       | 16 - MAX\_RESOLUTION                     | The height of the output video in pixels. Must be divisible by 16. (default: 480)                                                            |
| `length`                 | INT                | No       | 1 - MAX\_RESOLUTION                      | The number of frames to generate. (default: 81)                                                                                              |
| `clip_vision_output`     | CLIPVISIONOUTPUT   | No       | -                                        | Optional CLIP vision output for additional conditioning.                                                                                     |
| `start_image`            | IMAGE              | No       | -                                        | An optional starting image to initialize the video sequence.                                                                                 |
| `audio_encoder_output_1` | AUDIOENCODEROUTPUT | Yes      | -                                        | The primary audio encoder output containing features for the first speaker.                                                                  |
| `motion_frame_count`     | INT                | No       | 1 - 33                                   | Number of previous frames to use as motion context when extending a sequence. (default: 9)                                                   |
| `audio_scale`            | FLOAT              | No       | -10.0 - 10.0                             | A scaling factor applied to the audio conditioning. (default: 1.0)                                                                           |
| `previous_frames`        | IMAGE              | No       | -                                        | Optional previous video frames to extend from.                                                                                               |
| `audio_encoder_output_2` | AUDIOENCODEROUTPUT | No       | -                                        | The second audio encoder output. Required when `mode` is set to `"two_speakers"`.                                                            |
| `mask_1`                 | MASK               | No       | -                                        | Mask for the first speaker, required if using two audio inputs.                                                                              |
| `mask_2`                 | MASK               | No       | -                                        | Mask for the second speaker, required if using two audio inputs.                                                                             |

**Parameter Constraints:**

* When `mode` is set to `"two_speakers"`, the parameters `audio_encoder_output_2`, `mask_1`, and `mask_2` become required.
* If `audio_encoder_output_2` is provided, both `mask_1` and `mask_2` must also be provided.
* If `mask_1` and `mask_2` are provided, `audio_encoder_output_2` must also be provided.
* If `previous_frames` is provided, it must contain at least as many frames as specified by `motion_frame_count`.

## Outputs

| Output Name  | Data Type    | Description                                                                                                 |
| ------------ | ------------ | ----------------------------------------------------------------------------------------------------------- |
| `model`      | MODEL        | The patched model with audio conditioning applied.                                                          |
| `positive`   | CONDITIONING | The positive conditioning, potentially modified with additional context (e.g., start image, CLIP vision).   |
| `negative`   | CONDITIONING | The negative conditioning, potentially modified with additional context.                                    |
| `latent`     | LATENT       | The generated video sequence in latent space.                                                               |
| `trim_image` | INT          | The number of frames from the start of the motion context that should be trimmed when extending a sequence. |
