Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.comfy.org/llms.txt

Use this file to discover all available pages before exploring further.

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

SAM3 Detect Node

Overview

The SAM3 Detect node performs open-vocabulary detection and segmentation using text descriptions, bounding boxes, or point prompts. It can identify and segment objects in an image based on what you describe in text, where you draw boxes, or where you click points.

Inputs

ParameterData TypeRequiredRangeDescription
modelMODELYes-The SAM3 model to use for detection and segmentation
imageIMAGEYes-The input image to process
conditioningCONDITIONINGNo-Text conditioning from CLIPTextEncode. Required when using text prompts for detection
bboxesBOUNDING_BOXNo-Bounding boxes to segment within. Can be a single box (applied to all frames), a list of boxes (applied to all frames), or a list of lists (per-frame boxes). When provided without text conditioning, the node segments inside each box
positive_coordsSTRINGNo-Positive point prompts as JSON format [{"x": int, "y": int}, ...] using pixel coordinates. These are points you want to include in the segmentation
negative_coordsSTRINGNo-Negative point prompts as JSON format [{"x": int, "y": int}, ...] using pixel coordinates. These are points you want to exclude from the segmentation
thresholdFLOATNo0.0 to 1.0Confidence threshold for text-based detections. Only detections with scores above this value are kept (default: 0.5)
refine_iterationsINTNo0 to 5Number of SAM decoder refinement passes. Higher values can improve mask quality. Set to 0 to use raw detector masks without refinement (default: 2)
individual_masksBOOLEANNoTrue/FalseWhen enabled, outputs separate masks for each detected object instead of combining them into a single mask (default: False)

Parameter Constraints and Notes

  • Text prompts: To use text-based detection, you must provide conditioning input. When text conditioning is provided, the node runs text-guided detection on the image.
  • Box prompts: When bboxes are provided without text conditioning, the node segments the area inside each bounding box.
  • Point prompts: When positive_coords or negative_coords are provided, the node uses point-based segmentation. Points are scaled to the model’s internal resolution automatically.
  • Multiple prompt types: You can combine different prompt types. For example, you can provide both text conditioning and bounding boxes to restrict text detection to specific areas.
  • Batch processing: The node supports batched images. When processing multiple frames, bounding boxes can be provided per-frame using a list of lists format.
  • JSON format for points: Point coordinates must be provided as valid JSON strings in the format [{"x": 100, "y": 200}, {"x": 150, "y": 250}].

Outputs

Output NameData TypeDescription
masksMASKSegmentation masks. When individual_masks is False (default), returns a single combined mask per frame. When True, returns individual masks for each detected object
bboxesBOUNDING_BOXDetected bounding boxes with coordinates and confidence scores. Each box includes x, y, width, height, and score values

Source fingerprint (SHA-256): 3f61343c284c249476f2010831863c6094260b11d0a348003b270a126c67d399