Skip to main content
The WanInfiniteTalkToVideo node generates video sequences from audio input. It uses a video diffusion model, conditioned on audio features extracted from one or two speakers, to produce a latent representation of a talking head video. The node can generate a new sequence or extend an existing one using previous frames for motion context.

Inputs

ParameterDescriptionData TypeRequiredRange
modeThe audio input mode. "single_speaker" uses one audio input. "two_speakers" enables inputs for a second speaker and corresponding masks.COMBOYes"single_speaker"
"two_speakers"
modelThe base video diffusion model.MODELYes-
model_patchThe model patch containing audio projection layers.MODELPATCHYes-
positiveThe positive conditioning to guide the generation.CONDITIONINGYes-
negativeThe negative conditioning to guide the generation.CONDITIONINGYes-
vaeThe VAE used for encoding images to and from the latent space.VAEYes-
widthThe width of the output video in pixels. Must be divisible by 16. (default: 832)INTNo16 - MAX_RESOLUTION
heightThe height of the output video in pixels. Must be divisible by 16. (default: 480)INTNo16 - MAX_RESOLUTION
lengthThe number of frames to generate. (default: 81)INTNo1 - MAX_RESOLUTION
clip_vision_outputOptional CLIP vision output for additional conditioning.CLIPVISIONOUTPUTNo-
start_imageAn optional starting image to initialize the video sequence.IMAGENo-
audio_encoder_output_1The primary audio encoder output containing features for the first speaker.AUDIOENCODEROUTPUTYes-
motion_frame_countNumber of previous frames to use as motion context when extending a sequence. (default: 9)INTNo1 - 33
audio_scaleA scaling factor applied to the audio conditioning. (default: 1.0)FLOATNo-10.0 - 10.0
previous_framesOptional previous video frames to extend from.IMAGENo-
audio_encoder_output_2The second audio encoder output. Required when mode is set to "two_speakers".AUDIOENCODEROUTPUTNo-
mask_1Mask for the first speaker, required if using two audio inputs.MASKNo-
mask_2Mask for the second speaker, required if using two audio inputs.MASKNo-
Parameter Constraints:
  • When mode is set to "two_speakers", the parameters audio_encoder_output_2, mask_1, and mask_2 become required.
  • If audio_encoder_output_2 is provided, both mask_1 and mask_2 must also be provided.
  • If mask_1 and mask_2 are provided, audio_encoder_output_2 must also be provided.
  • If previous_frames is provided, it must contain at least as many frames as specified by motion_frame_count.

Outputs

Output NameDescriptionData Type
modelThe patched model with audio conditioning applied.MODEL
positiveThe positive conditioning, potentially modified with additional context (e.g., start image, CLIP vision).CONDITIONING
negativeThe negative conditioning, potentially modified with additional context.CONDITIONING
latentThe generated video sequence in latent space.LATENT
trim_imageThe number of frames from the start of the motion context that should be trimmed when extending a sequence.INT
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): 1ef125235ce5adb09972737d0e2863255315c536da718c7af230de1b4a7f53e2