The ClipVisionEncode node is used to encode input images into visual feature vectors through the CLIP Vision model.
The CLIP Vision Encode
node is an image encoding node in ComfyUI, used to convert input images into visual feature vectors through the CLIP Vision model. This node is an important bridge connecting image and text understanding, and is widely used in various AI image generation and processing workflows.
Node Functionality
Parameter Name | Data Type | Description |
---|---|---|
clip_vision | CLIP_VISION | CLIP vision model, usually loaded via the CLIPVisionLoader node |
image | IMAGE | The input image to be encoded |
crop | Dropdown | Image cropping method, options: center (center crop), none (no crop) |
Output Name | Data Type | Description |
---|---|---|
CLIP_VISION_OUTPUT | CLIP_VISION_OUTPUT | Encoded visual features |
This output object contains:
last_hidden_state
: The last hidden stateimage_embeds
: Image embedding vectorpenultimate_hidden_states
: The penultimate hidden statemm_projected
: Multimodal projection result (if available)