WanSCAILToVideo - ComfyUI Built-in Node Documentation

The WanSCAILToVideo node prepares conditioning and an empty latent space for video generation. It processes optional inputs like reference images, pose videos, CLIP vision outputs, and previous frame chunks, embedding them into the positive and negative conditioning for a video model. The node outputs the modified conditioning and a blank latent tensor of the specified video dimensions.

Inputs

Parameter	Description	Data Type	Required	Range
`positive`	The positive conditioning input.	CONDITIONING	Yes	-
`negative`	The negative conditioning input.	CONDITIONING	Yes	-
`vae`	The VAE model used for encoding images and video frames.	VAE	Yes	-
`width`	The width of the output video in pixels (default: 512). Must be divisible by 32.	INT	Yes	32 to MAX_RESOLUTION
`height`	The height of the output video in pixels (default: 896). Must be divisible by 32.	INT	Yes	32 to MAX_RESOLUTION
`length`	The number of frames in the video (default: 81). Must be divisible by 4.	INT	Yes	1 to MAX_RESOLUTION
`batch_size`	The number of videos to generate in a batch (default: 1).	INT	Yes	1 to 4096
`pose_video`	Video used for pose conditioning. Will be downscaled to half the resolution of the main video.	IMAGE	No	-
`pose_video_mask`	SCAIL-2 only. Colored per-identity SAM3 mask video at the same resolution as pose_video.	IMAGE	No	-
`replacement_mode`	SCAIL-2 only. False = Animation Mode (pose_video_mask should have black background). True = Replacement Mode (pose_video_mask should have white background). Default: False.	BOOLEAN	No	-
`pose_strength`	Strength of the pose latent (default: 1.0).	FLOAT	Yes	0.0 to 10.0
`pose_start`	Start step of the pose conditioning (default: 0.0).	FLOAT	Yes	0.0 to 1.0
`pose_end`	End step of the pose conditioning (default: 1.0).	FLOAT	Yes	0.0 to 1.0
`reference_image`	Reference image, for multiple references composite all on single image.	IMAGE	No	-
`reference_image_mask`	SCAIL-2 only. Colored reference mask at the same resolution as reference_image.	IMAGE	No	-
`clip_vision_output`	CLIP vision features for conditioning. Model is trained with stretch resize to aspect ratio.	CLIP_VISION_OUTPUT	No	-
`video_frame_offset`	Cumulative output frame this chunk begins at. Wire from the previous chunk’s video_frame_offset output (default: 0).	INT	Yes	0 to MAX_RESOLUTION
`previous_frame_count`	Tail frames of previous_frames to anchor. SCAIL-2 trained at 5 (81-frame chunks, 76-frame step) (default: 5).	INT	Yes	1 to MAX_RESOLUTION
`previous_frames`	SCAIL-2 only. Full decoded output of the previous chunk. Only the last previous_frame_count are used as the extension anchor.	IMAGE	No	-

Note: The pose_video and pose_video_mask inputs are processed only for the first length frames. The reference_image is processed only for the first image in the batch. When reference_image is provided, it is encoded into a latent and embedded into both positive and negative conditioning. When clip_vision_output is provided, it is applied to both positive and negative conditioning. The pose_video is downscaled to half the resolution of the main video before encoding. When previous_frames is provided, only the last previous_frame_count frames are used as the extension anchor, and the video_frame_offset is adjusted accordingly. In Replacement Mode (replacement_mode=True), the reference image is composited on a black background using the reference image mask as an alpha matte.

Outputs

Output Name	Description	Data Type
`positive`	The modified positive conditioning, potentially containing embedded reference image latents, CLIP vision output, pose video latents, driving masks, reference masks, or previous frame latents.	CONDITIONING
`negative`	The modified negative conditioning, potentially containing embedded reference image latents, CLIP vision output, pose video latents, driving masks, reference masks, or previous frame latents.	CONDITIONING
`latent`	An empty latent tensor of shape `[batch_size, 16, ((length - 1) // 4) + 1, height // 8, width // 8]`. When previous_frames is provided, the latent is partially filled with encoded previous frames and a noise mask is included.	LATENT
`video_frame_offset`	Adjusted offset + length. Wire into the next chunk for sequential video generation.	INT

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): 30e14959248c46e624e2ce2e3d079cd5aad94c12b66d74d4979ef70143b871e3

​Inputs

​Outputs

Inputs

Outputs