WanInfiniteTalkToVideo - ComfyUI Built-in Node Documentation

The WanInfiniteTalkToVideo node generates video sequences from audio input. It uses a video diffusion model, conditioned on audio features extracted from one or two speakers, to produce a latent representation of a talking head video. The node can generate a new sequence or extend an existing one using previous frames for motion context.

Inputs

Parameter	Description	Data Type	Required	Range
`mode`	The audio input mode. `"single_speaker"` uses one audio input. `"two_speakers"` enables inputs for a second speaker and corresponding masks.	COMBO	Yes	`"single_speaker"` `"two_speakers"`
`model`	The base video diffusion model.	MODEL	Yes	-
`model_patch`	The model patch containing audio projection layers.	MODELPATCH	Yes	-
`positive`	The positive conditioning to guide the generation.	CONDITIONING	Yes	-
`negative`	The negative conditioning to guide the generation.	CONDITIONING	Yes	-
`vae`	The VAE used for encoding images to and from the latent space.	VAE	Yes	-
`width`	The width of the output video in pixels. Must be divisible by 16. (default: 832)	INT	No	16 - MAX_RESOLUTION
`height`	The height of the output video in pixels. Must be divisible by 16. (default: 480)	INT	No	16 - MAX_RESOLUTION
`length`	The number of frames to generate. (default: 81)	INT	No	1 - MAX_RESOLUTION
`clip_vision_output`	Optional CLIP vision output for additional conditioning.	CLIPVISIONOUTPUT	No	-
`start_image`	An optional starting image to initialize the video sequence.	IMAGE	No	-
`audio_encoder_output_1`	The primary audio encoder output containing features for the first speaker.	AUDIOENCODEROUTPUT	Yes	-
`motion_frame_count`	Number of previous frames to use as motion context when extending a sequence. (default: 9)	INT	No	1 - 33
`audio_scale`	A scaling factor applied to the audio conditioning. (default: 1.0)	FLOAT	No	-10.0 - 10.0
`previous_frames`	Optional previous video frames to extend from.	IMAGE	No	-
`audio_encoder_output_2`	The second audio encoder output. Required when `mode` is set to `"two_speakers"`.	AUDIOENCODEROUTPUT	No	-
`mask_1`	Mask for the first speaker, required if using two audio inputs.	MASK	No	-
`mask_2`	Mask for the second speaker, required if using two audio inputs.	MASK	No	-

Parameter Constraints:

When mode is set to "two_speakers", the parameters audio_encoder_output_2, mask_1, and mask_2 become required.
If audio_encoder_output_2 is provided, both mask_1 and mask_2 must also be provided.
If mask_1 and mask_2 are provided, audio_encoder_output_2 must also be provided.
If previous_frames is provided, it must contain at least as many frames as specified by motion_frame_count.

Outputs

Output Name	Description	Data Type
`model`	The patched model with audio conditioning applied.	MODEL
`positive`	The positive conditioning, potentially modified with additional context (e.g., start image, CLIP vision).	CONDITIONING
`negative`	The negative conditioning, potentially modified with additional context.	CONDITIONING
`latent`	The generated video sequence in latent space.	LATENT
`trim_image`	The number of frames from the start of the motion context that should be trimmed when extending a sequence.	INT

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): 1ef125235ce5adb09972737d0e2863255315c536da718c7af230de1b4a7f53e2

​Inputs

​Outputs

Inputs

Outputs