WanFirstLastFrameToVideo - ComfyUI Built-in Node Documentation

The WanFirstLastFrameToVideo node creates video conditioning by combining start and end frames with text prompts. It generates a latent representation for video generation by encoding the first and last frames, applying masks to guide the generation process, and incorporating CLIP vision features when available. This node prepares both positive and negative conditioning for video models to generate coherent sequences between specified start and end points.

Inputs

Parameter	Description	Data Type	Required	Range
`positive`	Positive text conditioning for guiding the video generation	CONDITIONING	Yes	-
`negative`	Negative text conditioning for guiding the video generation	CONDITIONING	Yes	-
`vae`	VAE model used for encoding images to latent space	VAE	Yes	-
`width`	Output video width (default: 832, step: 16)	INT	Yes	16 to MAX_RESOLUTION
`height`	Output video height (default: 480, step: 16)	INT	Yes	16 to MAX_RESOLUTION
`length`	Number of frames in the video sequence (default: 81, step: 4)	INT	Yes	1 to MAX_RESOLUTION
`batch_size`	Number of videos to generate simultaneously (default: 1)	INT	Yes	1 to 4096
`clip_vision_start_image`	CLIP vision features extracted from the start image	CLIP_VISION_OUTPUT	No	-
`clip_vision_end_image`	CLIP vision features extracted from the end image	CLIP_VISION_OUTPUT	No	-
`start_image`	Starting frame image for the video sequence	IMAGE	No	-
`end_image`	Ending frame image for the video sequence	IMAGE	No	-

Note: When both start_image and end_image are provided, the node creates a video sequence that transitions between these two frames. The clip_vision_start_image and clip_vision_end_image parameters are optional but when provided, their CLIP vision features are concatenated and applied to both positive and negative conditioning. The start_image is cropped to the first length frames, and the end_image is cropped to the last length frames before processing.

Outputs

Output Name	Description	Data Type
`positive`	Positive conditioning with applied video frame encoding and CLIP vision features	CONDITIONING
`negative`	Negative conditioning with applied video frame encoding and CLIP vision features	CONDITIONING
`latent`	Empty latent tensor with dimensions matching the specified video parameters	LATENT

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): c9017cbed8d90e0c22a27c035784396fe1fa1551d586e2ec148e0621228162c0

​Inputs

​Outputs

Inputs

Outputs