> ## Documentation Index
> Fetch the complete documentation index at: https://docs.comfy.org/llms.txt
> Use this file to discover all available pages before exploring further.

# WanInfiniteTalkToVideo - ComfyUI Built-in Node Documentation

> Complete documentation for the WanInfiniteTalkToVideo node in ComfyUI. Learn its inputs, outputs, parameters and usage.

The WanInfiniteTalkToVideo node generates video sequences from audio input. It uses a video diffusion model, conditioned on audio features extracted from one or two speakers, to produce a latent representation of a talking head video. The node can generate a new sequence or extend an existing one using previous frames for motion context.

## Inputs

| Parameter                | Description                                                                                                                                  | Data Type          | Required | Range                                    |
| ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | -------- | ---------------------------------------- |
| `mode`                   | The audio input mode. `"single_speaker"` uses one audio input. `"two_speakers"` enables inputs for a second speaker and corresponding masks. | COMBO              | Yes      | `"single_speaker"`<br />`"two_speakers"` |
| `model`                  | The base video diffusion model.                                                                                                              | MODEL              | Yes      | -                                        |
| `model_patch`            | The model patch containing audio projection layers.                                                                                          | MODELPATCH         | Yes      | -                                        |
| `positive`               | The positive conditioning to guide the generation.                                                                                           | CONDITIONING       | Yes      | -                                        |
| `negative`               | The negative conditioning to guide the generation.                                                                                           | CONDITIONING       | Yes      | -                                        |
| `vae`                    | The VAE used for encoding images to and from the latent space.                                                                               | VAE                | Yes      | -                                        |
| `width`                  | The width of the output video in pixels. Must be divisible by 16. (default: 832)                                                             | INT                | No       | 16 - MAX\_RESOLUTION                     |
| `height`                 | The height of the output video in pixels. Must be divisible by 16. (default: 480)                                                            | INT                | No       | 16 - MAX\_RESOLUTION                     |
| `length`                 | The number of frames to generate. (default: 81)                                                                                              | INT                | No       | 1 - MAX\_RESOLUTION                      |
| `clip_vision_output`     | Optional CLIP vision output for additional conditioning.                                                                                     | CLIPVISIONOUTPUT   | No       | -                                        |
| `start_image`            | An optional starting image to initialize the video sequence.                                                                                 | IMAGE              | No       | -                                        |
| `audio_encoder_output_1` | The primary audio encoder output containing features for the first speaker.                                                                  | AUDIOENCODEROUTPUT | Yes      | -                                        |
| `motion_frame_count`     | Number of previous frames to use as motion context when extending a sequence. (default: 9)                                                   | INT                | No       | 1 - 33                                   |
| `audio_scale`            | A scaling factor applied to the audio conditioning. (default: 1.0)                                                                           | FLOAT              | No       | -10.0 - 10.0                             |
| `previous_frames`        | Optional previous video frames to extend from.                                                                                               | IMAGE              | No       | -                                        |
| `audio_encoder_output_2` | The second audio encoder output. Required when `mode` is set to `"two_speakers"`.                                                            | AUDIOENCODEROUTPUT | No       | -                                        |
| `mask_1`                 | Mask for the first speaker, required if using two audio inputs.                                                                              | MASK               | No       | -                                        |
| `mask_2`                 | Mask for the second speaker, required if using two audio inputs.                                                                             | MASK               | No       | -                                        |

**Parameter Constraints:**

* When `mode` is set to `"two_speakers"`, the parameters `audio_encoder_output_2`, `mask_1`, and `mask_2` become required.
* If `audio_encoder_output_2` is provided, both `mask_1` and `mask_2` must also be provided.
* If `mask_1` and `mask_2` are provided, `audio_encoder_output_2` must also be provided.
* If `previous_frames` is provided, it must contain at least as many frames as specified by `motion_frame_count`.

## Outputs

| Output Name  | Description                                                                                                 | Data Type    |
| ------------ | ----------------------------------------------------------------------------------------------------------- | ------------ |
| `model`      | The patched model with audio conditioning applied.                                                          | MODEL        |
| `positive`   | The positive conditioning, potentially modified with additional context (e.g., start image, CLIP vision).   | CONDITIONING |
| `negative`   | The negative conditioning, potentially modified with additional context.                                    | CONDITIONING |
| `latent`     | The generated video sequence in latent space.                                                               | LATENT       |
| `trim_image` | The number of frames from the start of the motion context that should be trimmed when extending a sequence. | INT          |

> This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! [Edit on GitHub](https://github.com/Comfy-Org/embedded-docs/blob/main/comfyui_embedded_docs/docs/WanInfiniteTalkToVideo/en.md)

***

**Source fingerprint (SHA-256):** `1ef125235ce5adb09972737d0e2863255315c536da718c7af230de1b4a7f53e2`