> ## Documentation Index
> Fetch the complete documentation index at: https://docs.comfy.org/llms.txt
> Use this file to discover all available pages before exploring further.

# SAM3_Detect - ComfyUI Built-in Node Documentation

> Complete documentation for the SAM3_Detect node in ComfyUI. Learn its inputs, outputs, parameters and usage.

> This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! [Edit on GitHub](https://github.com/Comfy-Org/embedded-docs/blob/main/comfyui_embedded_docs/docs/SAM3_Detect/en.md)

# SAM3 Detect Node

## Overview

The SAM3 Detect node performs open-vocabulary detection and segmentation using text descriptions, bounding boxes, or point prompts. It can identify and segment objects in an image based on what you describe in text, where you draw boxes, or where you click points.

## Inputs

| Parameter           | Data Type     | Required | Range      | Description                                                                                                                                                                                                                              |
| ------------------- | ------------- | -------- | ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model`             | MODEL         | Yes      | -          | The SAM3 model to use for detection and segmentation                                                                                                                                                                                     |
| `image`             | IMAGE         | Yes      | -          | The input image to process                                                                                                                                                                                                               |
| `conditioning`      | CONDITIONING  | No       | -          | Text conditioning from CLIPTextEncode. Required when using text prompts for detection                                                                                                                                                    |
| `bboxes`            | BOUNDING\_BOX | No       | -          | Bounding boxes to segment within. Can be a single box (applied to all frames), a list of boxes (applied to all frames), or a list of lists (per-frame boxes). When provided without text conditioning, the node segments inside each box |
| `positive_coords`   | STRING        | No       | -          | Positive point prompts as JSON format `[{"x": int, "y": int}, ...]` using pixel coordinates. These are points you want to include in the segmentation                                                                                    |
| `negative_coords`   | STRING        | No       | -          | Negative point prompts as JSON format `[{"x": int, "y": int}, ...]` using pixel coordinates. These are points you want to exclude from the segmentation                                                                                  |
| `threshold`         | FLOAT         | No       | 0.0 to 1.0 | Confidence threshold for text-based detections. Only detections with scores above this value are kept (default: 0.5)                                                                                                                     |
| `refine_iterations` | INT           | No       | 0 to 5     | Number of SAM decoder refinement passes. Higher values can improve mask quality. Set to 0 to use raw detector masks without refinement (default: 2)                                                                                      |
| `individual_masks`  | BOOLEAN       | No       | True/False | When enabled, outputs separate masks for each detected object instead of combining them into a single mask (default: False)                                                                                                              |

### Parameter Constraints and Notes

* **Text prompts**: To use text-based detection, you must provide `conditioning` input. When text conditioning is provided, the node runs text-guided detection on the image.
* **Box prompts**: When `bboxes` are provided without text conditioning, the node segments the area inside each bounding box.
* **Point prompts**: When `positive_coords` or `negative_coords` are provided, the node uses point-based segmentation. Points are scaled to the model's internal resolution automatically.
* **Multiple prompt types**: You can combine different prompt types. For example, you can provide both text conditioning and bounding boxes to restrict text detection to specific areas.
* **Batch processing**: The node supports batched images. When processing multiple frames, bounding boxes can be provided per-frame using a list of lists format.
* **JSON format for points**: Point coordinates must be provided as valid JSON strings in the format `[{"x": 100, "y": 200}, {"x": 150, "y": 250}]`.

## Outputs

| Output Name | Data Type     | Description                                                                                                                                                            |
| ----------- | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `masks`     | MASK          | Segmentation masks. When `individual_masks` is False (default), returns a single combined mask per frame. When True, returns individual masks for each detected object |
| `bboxes`    | BOUNDING\_BOX | Detected bounding boxes with coordinates and confidence scores. Each box includes `x`, `y`, `width`, `height`, and `score` values                                      |

***

**Source fingerprint (SHA-256):** `3f61343c284c249476f2010831863c6094260b11d0a348003b270a126c67d399`
