Documentation Index
Fetch the complete documentation index at: https://docs.comfy.org/llms.txt
Use this file to discover all available pages before exploring further.
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub
SAM3 Detect Node
Overview
The SAM3 Detect node performs open-vocabulary detection and segmentation using text descriptions, bounding boxes, or point prompts. It can identify and segment objects in an image based on what you describe in text, where you draw boxes, or where you click points.Inputs
| Parameter | Data Type | Required | Range | Description |
|---|---|---|---|---|
model | MODEL | Yes | - | The SAM3 model to use for detection and segmentation |
image | IMAGE | Yes | - | The input image to process |
conditioning | CONDITIONING | No | - | Text conditioning from CLIPTextEncode. Required when using text prompts for detection |
bboxes | BOUNDING_BOX | No | - | Bounding boxes to segment within. Can be a single box (applied to all frames), a list of boxes (applied to all frames), or a list of lists (per-frame boxes). When provided without text conditioning, the node segments inside each box |
positive_coords | STRING | No | - | Positive point prompts as JSON format [{"x": int, "y": int}, ...] using pixel coordinates. These are points you want to include in the segmentation |
negative_coords | STRING | No | - | Negative point prompts as JSON format [{"x": int, "y": int}, ...] using pixel coordinates. These are points you want to exclude from the segmentation |
threshold | FLOAT | No | 0.0 to 1.0 | Confidence threshold for text-based detections. Only detections with scores above this value are kept (default: 0.5) |
refine_iterations | INT | No | 0 to 5 | Number of SAM decoder refinement passes. Higher values can improve mask quality. Set to 0 to use raw detector masks without refinement (default: 2) |
individual_masks | BOOLEAN | No | True/False | When enabled, outputs separate masks for each detected object instead of combining them into a single mask (default: False) |
Parameter Constraints and Notes
- Text prompts: To use text-based detection, you must provide
conditioninginput. When text conditioning is provided, the node runs text-guided detection on the image. - Box prompts: When
bboxesare provided without text conditioning, the node segments the area inside each bounding box. - Point prompts: When
positive_coordsornegative_coordsare provided, the node uses point-based segmentation. Points are scaled to the model’s internal resolution automatically. - Multiple prompt types: You can combine different prompt types. For example, you can provide both text conditioning and bounding boxes to restrict text detection to specific areas.
- Batch processing: The node supports batched images. When processing multiple frames, bounding boxes can be provided per-frame using a list of lists format.
- JSON format for points: Point coordinates must be provided as valid JSON strings in the format
[{"x": 100, "y": 200}, {"x": 150, "y": 250}].
Outputs
| Output Name | Data Type | Description |
|---|---|---|
masks | MASK | Segmentation masks. When individual_masks is False (default), returns a single combined mask per frame. When True, returns individual masks for each detected object |
bboxes | BOUNDING_BOX | Detected bounding boxes with coordinates and confidence scores. Each box includes x, y, width, height, and score values |
Source fingerprint (SHA-256):
3f61343c284c249476f2010831863c6094260b11d0a348003b270a126c67d399