AIAny - SAM 3: Segment Anything with Concepts

Overview

SAM 3 (Segment Anything with Concepts) is a unified foundation model released by Meta for promptable segmentation in images and videos. It extends prior "Segment Anything" work by enabling open-vocabulary concept segmentation: given a short text phrase or exemplar(s), SAM 3 can exhaustively detect and segment all instances that match the concept across an image or video sequence.

Key features

Open-vocabulary segmentation: supports a vastly larger set of text-based prompts (noun phrases/concepts) than prior benchmarks.
Multi-modal prompts: accepts text prompts and visual prompts (points, boxes, masks, exemplars).
Presence token: a novel architectural element that improves discrimination between closely related prompts (e.g., "player in white" vs "player in red").
Decoupled detector–tracker design: detector (DETR-based) and tracker (inheriting SAM 2 transformer encoder-decoder) share a vision encoder but are decoupled to reduce task interference and scale with data.
Image and video support: can perform interactive image segmentation, exhaustive concept segmentation, and video segmentation/tracking.
Large annotated concept dataset (SA-Co): trained and evaluated on massive automatically-annotated concept data (millions of unique concepts) and new SA-Co benchmarks for images and video.

Architecture & scale

Shared vision encoder with two heads: a DETR-style detector conditioned on text, geometry, and exemplars; and a tracker decoder adapted from SAM 2 for temporal continuity and interactive refinement.
Model size: ~848 million parameters.

Data & benchmarks

SAM 3 is driven by a large data-engine that automatically annotated millions of unique concepts, producing the SA-Co dataset (including SA-Co/Gold, SA-Co/Silver and SA-Co/VEval for video). The repository reports performance on SA-Co and other benchmarks (LVIS, COCO variants), demonstrating substantially improved open-vocabulary concept performance compared to prior open-vocabulary methods.

Repository & usage

The GitHub repo provides code for building image and video models, processors for prompting, example notebooks (image/video predictors, batched inference, agent usage), evaluation scripts for SA-Co, and instructions for finetuning.
Requirements: Python 3.12+, PyTorch 2.7+, CUDA 12.6+ (per README). Example install commands and conda env creation are provided.
Checkpoints: trained checkpoints are hosted on Hugging Face and require requesting access; the README documents authentication and download steps.

Typical workflows

Interactive segmentation: set an image or video session, provide text or visual prompts, refine with points/boxes.
Batched inference: run batched image inference for large-scale processing.
Fine-tuning: repository includes scripts and options for training/finetuning on custom data.

License & citation

The project is distributed under the SAM License (see LICENSE file in repo).
The README contains a recommended BibTeX citation for the SAM 3 paper and links to Meta's project page, demo, and blog announcing SAM 3.

Who should use it

Researchers and practitioners working on segmentation, open-vocabulary vision models, vision–language interaction, video object tracking/segmentation, and tools that need robust concept-level segmentation will find SAM 3 useful. The repository balances research artifacts (benchmarks, evaluation) with engineering code (inference, deployment examples).

SAM 3: Segment Anything with Concepts

Introduction

Overview

Key features

Architecture & scale

Data & benchmarks

Repository & usage

Typical workflows

License & citation

Who should use it

Information

Categories

Tags

More Items

Grok-1

Tianshou

UltraRAG