Skip to main content
The CLIP Vision Encode node is an image encoding node in ComfyUI, used to convert input images into visual feature vectors through the CLIP Vision model. This node is an important bridge connecting image and text understanding, and is widely used in various AI image generation and processing workflows. Node Functionality
  • Image feature extraction: Converts input images into high-dimensional feature vectors
  • Multimodal bridging: Provides a foundation for joint processing of images and text
  • Conditional generation: Provides visual conditions for image-based conditional generation

Inputs

Parameter NameDescriptionData Type
clip_visionCLIP vision model, usually loaded via the CLIPVisionLoader nodeCLIP_VISION
imageThe input image to be encodedIMAGE
cropImage cropping method, options: center (center crop), none (no crop)Dropdown

Outputs

Output NameDescriptionData Type
CLIP_VISION_OUTPUTEncoded visual featuresCLIP_VISION_OUTPUT
This output object contains:
  • last_hidden_state: The last hidden state
  • image_embeds: Image embedding vector
  • penultimate_hidden_states: The penultimate hidden state
  • mm_projected: Multimodal projection result (if available)
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub