Overview
SAM 3 (Segment Anything with Concepts) is a unified foundation model released by Meta for promptable segmentation in images and videos. It extends prior "Segment Anything" work by enabling open-vocabulary concept segmentation: given a short text phrase or exemplar(s), SAM 3 can exhaustively detect and segment all instances that match the concept across an image or video sequence.
Key features
- Open-vocabulary segmentation: supports a vastly larger set of text-based prompts (noun phrases/concepts) than prior benchmarks.
- Multi-modal prompts: accepts text prompts and visual prompts (points, boxes, masks, exemplars).
- Presence token: a novel architectural element that improves discrimination between closely related prompts (e.g., "player in white" vs "player in red").
- Decoupled detector–tracker design: detector (DETR-based) and tracker (inheriting SAM 2 transformer encoder-decoder) share a vision encoder but are decoupled to reduce task interference and scale with data.
- Image and video support: can perform interactive image segmentation, exhaustive concept segmentation, and video segmentation/tracking.
- Large annotated concept dataset (SA-Co): trained and evaluated on massive automatically-annotated concept data (millions of unique concepts) and new SA-Co benchmarks for images and video.
Architecture & scale
- Shared vision encoder with two heads: a DETR-style detector conditioned on text, geometry, and exemplars; and a tracker decoder adapted from SAM 2 for temporal continuity and interactive refinement.
- Model size: ~848 million parameters.
Data & benchmarks
SAM 3 is driven by a large data-engine that automatically annotated millions of unique concepts, producing the SA-Co dataset (including SA-Co/Gold, SA-Co/Silver and SA-Co/VEval for video). The repository reports performance on SA-Co and other benchmarks (LVIS, COCO variants), demonstrating substantially improved open-vocabulary concept performance compared to prior open-vocabulary methods.
Repository & usage
- The GitHub repo provides code for building image and video models, processors for prompting, example notebooks (image/video predictors, batched inference, agent usage), evaluation scripts for SA-Co, and instructions for finetuning.
- Requirements: Python 3.12+, PyTorch 2.7+, CUDA 12.6+ (per README). Example install commands and conda env creation are provided.
- Checkpoints: trained checkpoints are hosted on Hugging Face and require requesting access; the README documents authentication and download steps.
Typical workflows
- Interactive segmentation: set an image or video session, provide text or visual prompts, refine with points/boxes.
- Batched inference: run batched image inference for large-scale processing.
- Fine-tuning: repository includes scripts and options for training/finetuning on custom data.
License & citation
- The project is distributed under the SAM License (see LICENSE file in repo).
- The README contains a recommended BibTeX citation for the SAM 3 paper and links to Meta's project page, demo, and blog announcing SAM 3.
Who should use it
Researchers and practitioners working on segmentation, open-vocabulary vision models, vision–language interaction, video object tracking/segmentation, and tools that need robust concept-level segmentation will find SAM 3 useful. The repository balances research artifacts (benchmarks, evaluation) with engineering code (inference, deployment examples).
