The CLIP Vision Encode node is an image encoding node in ComfyUI, used to convert input images into visual feature vectors through the CLIP Vision model. This node is an important bridge connecting image and text understanding, and is widely used in various AI image generation and processing workflows.

Node Functionality

  • Image feature extraction: Converts input images into high-dimensional feature vectors
  • Multimodal bridging: Provides a foundation for joint processing of images and text
  • Conditional generation: Provides visual conditions for image-based conditional generation

Inputs

Parameter NameData TypeDescription
clip_visionCLIP_VISIONCLIP vision model, usually loaded via the CLIPVisionLoader node
imageIMAGEThe input image to be encoded
cropDropdownImage cropping method, options: center (center crop), none (no crop)

Outputs

Output NameData TypeDescription
CLIP_VISION_OUTPUTCLIP_VISION_OUTPUTEncoded visual features

This output object contains:

  • last_hidden_state: The last hidden state
  • image_embeds: Image embedding vector
  • penultimate_hidden_states: The penultimate hidden state
  • mm_projected: Multimodal projection result (if available)