WanDancerVideo - ComfyUI Built-in Node Documentation

The WanDancerVideo node prepares conditioning data and an empty latent tensor for video generation with the WanDancer model. It combines positive and negative conditioning with optional inputs like a starting image, mask, CLIP vision embeddings, and audio features to control the generated video.

Inputs

Parameter	Description	Data Type	Required	Range
`positive`	The positive conditioning to guide video generation.	CONDITIONING	Yes
`negative`	The negative conditioning to guide video generation.	CONDITIONING	Yes
`vae`	The VAE used to encode the start image into the latent space.	VAE	Yes
`width`	The width of the generated video in pixels (default: 480).	INT	Yes	16 to MAX_RESOLUTION (step: 16)
`height`	The height of the generated video in pixels (default: 832).	INT	Yes	16 to MAX_RESOLUTION (step: 16)
`length`	The number of frames in the generated video. Should stay 149 for WanDancer (default: 149).	INT	Yes	1 to MAX_RESOLUTION (step: 4)
`clip_vision_output`	The CLIP vision embeddings for the first frame.	CLIP_VISION_OUTPUT	No
`clip_vision_output_ref`	The CLIP vision embeddings for the reference image.	CLIP_VISION_OUTPUT	No
`start_image`	The initial image(s) to be encoded. Can be any number of frames, up to the specified `length`.	IMAGE	No
`mask`	Image conditioning mask for the start image(s). White areas are kept, black areas are generated. Used for local generations.	MASK	No
`audio_encoder_output`	The output from an audio encoder, providing audio features, fps, and inject scale for audio-conditional generation.	AUDIO_ENCODER_OUTPUT	No

Note on Parameter Constraints:

The start_image and mask inputs are optional but can be used together. When start_image is provided, it is encoded and concatenated with the latent. If mask is also provided, it controls which parts of the start image are kept (white) and which are regenerated (black). If mask is not provided, the entire start image area is used as a conditioning guide.
The clip_vision_output and clip_vision_output_ref inputs are optional and can be used together to provide visual context for the first frame and a reference image.
The audio_encoder_output input is optional and provides audio features for audio-conditional generation.

Outputs

Output Name	Description	Data Type
`positive`	The positive conditioning with any additional data (concat latent, CLIP vision, audio) attached.	CONDITIONING
`negative`	The negative conditioning with any additional data (concat latent, CLIP vision, audio) attached.	CONDITIONING
`latent`	An empty latent tensor with dimensions matching the specified video length, height, and width.	LATENT

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): 0a75b24c8e5c164d81b08eb438862d94d4409ece8dc22c126979347e2350c828

​Inputs

​Outputs

Inputs

Outputs