LogoAIAny
Icon for item

SAM 3: Segment Anything with Concepts

Promptable image/video segmentation that finds, segments, and tracks all instances of an open‑vocabulary concept from short text or exemplars. Key differences: an automated 4M‑concept data engine, a presence token for finer text discrimination, and a detector–tracker design. Checkpoints on Hugging Face; GPU required.

Introduction

Open‑vocabulary segmentation often fails because benchmarks and datasets only cover thousands of labels. SAM 3 flips that constraint: by automatically annotating millions of unique concepts and pairing a detector–tracker architecture with a presence token, it expands promptable segmentation to hundreds of thousands of noun phrases and practical video tracking scenarios while keeping a single, reusable model.

What Sets It Apart
  • Automated large‑scale concept engine: SAM 3's data pipeline produced >4 million unique concepts and the SA‑Co benchmarks (270K concepts). So what? It trains the model to generalize to far more natural language prompts than prior segmentation models, reducing reliance on manual label curation.
  • Presence token for text discrimination: the model adds a presence token to distinguish closely related prompts (e.g., “player in white” vs “player in red”). So what? This improves precision for fine-grained, open‑vocabulary queries where conventional text conditioning confuses near‑synonyms.
  • Decoupled detector–tracker with shared encoder: a DETR‑style detector conditioned on text and exemplars sits alongside a SAM‑2‑style tracker that inherits the encoder. So what? This minimizes task interference (detection vs. temporal tracking), scales more cleanly with data, and supports efficient video sessions and joint multi‑object tracking (SAM 3.1 introduces a shared‑memory multiplexing variant).
  • Practical release & ecosystem: model code and examples (image/video notebooks) are on GitHub; checkpoints are hosted on Hugging Face (access controlled). So what? Researchers and engineers can reproduce evaluation and integrate SAM 3 into pipelines, but checkpoint access and GPU requirements are gating factors.
Who It's For and Trade‑offs

Great fit if you need broad, promptable segmentation across image and video (open‑vocabulary queries, large phrase coverage), want a single model to support both instance segmentation and temporal tracking, or plan to build vision components for multimodal agents or visual search. Look elsewhere if you require: lightweight on‑device inference (SAM 3 expects CUDA‑capable GPUs and recent PyTorch), fully open unrestricted checkpoint downloads (Hugging Face access is required), or if your application only needs a small closed set of categories where task‑specific detectors would be cheaper and simpler.

Where It Fits

Compared with SAM 2 and common detectors (OWLv2, DINO variants), SAM 3 trades model simplicity for vocabulary breadth and dataset scale — it targets tasks where prompt flexibility and concept coverage matter more than minimal compute. The repository includes evaluation on SA‑Co image/video benchmarks, example notebooks, and guidance for using the SAM 3.1 improved checkpoints (released 2026‑03‑27).

Information

  • Websitegithub.com
  • AuthorsMeta Superintelligence Labs, Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Piotr Dollár, Nikhila Ravi, Kate Saenko, Pengchuan Zhang, Christoph Feichtenhofer
  • Published date2025/07/17

Categories