LogoAIAny
Icon for item

DINOv3

DINOv3 is Meta AI Research's reference PyTorch implementation and model collection for a family of self-supervised vision foundation models. It provides high-resolution dense patch features, multiple pretrained backbones (ViT and ConvNeXt variants), pretrained heads for classification/detection/segmentation/depth, and integration examples for PyTorch Hub and Hugging Face. The repo contains training and evaluation scripts, notebooks, and instructions to obtain model weights.

Introduction

Overview

DINOv3 is a reference implementation and model release from Meta AI Research (FAIR) for a family of self-supervised vision foundation models producing high-quality dense patch-level features. The project focuses on versatile vision backbones (ViT and ConvNeXt variants) pretrained on large datasets and adapted to a variety of downstream tasks without or with minimal fine-tuning.

Key features
  • Backbones: multiple ViT sizes (including distilled variants and very large ViT-7B) and ConvNeXt variants with pretrained weights for different pretraining corpora (e.g., LVD-1689M for web images, SAT-493M for satellite imagery).
  • Dense features: the models produce high-resolution, patch-wise embeddings suitable for dense tasks such as segmentation, dense matching, and tracking.
  • Pretrained heads: released heads and examples for image classification, detection (COCO), segmentation (ADE20K), and depth estimation (SYNTHMIX/NYUv2), plus zero-shot setups (dino.txt).
  • Integration: explicit support and usage examples for PyTorch Hub, Hugging Face Transformers/Hub, and third-party libraries (timm). The README documents pipelines for feature extraction via Transformers and loading via torch.hub.
  • Notebooks and demos: several example notebooks (PCA visualization, foreground segmentation, dense/sparse matching, segmentation tracking, dino.txt zero-shot segmentation) with Colab links to help users get started.
  • Training & evaluation: full training and evaluation scripts, multi-stage recipes for large-scale models (including pretraining, gram anchoring, high-resolution adaptation for ViT-7B), and instructions for reproducing paper results.
  • Licensing & access: code and model weights released under the repository license (DINOv3 License). Some model weights require requesting access and downloading via provided URLs; the README advises using command-line tools like wget for the downloads.
Typical use cases
  • Extracting high-quality patch features for dense vision tasks (segmentation, matching, tracking).
  • Using pretrained backbones as drop-in feature extractors for downstream classifiers, detectors, or segmentation heads.
  • Research and development that requires large self-supervised vision models and reproduction of published experiments.
Practical notes
  • The repo expects modern PyTorch (README indicates PyTorch >= 2.7.1) and is tested in Linux environments; CUDA-enabled installations are recommended for performance.
  • Hugging Face and timm support are noted in the repository, enabling convenient model loading and inference pipelines.
  • Pretrained weights are organized by backbone and pretraining dataset; some large checkpoints (e.g., ViT-7B) and classifier/detector/segmentor heads are provided as separate downloads.
References
  • Associated paper: arXiv:2508.10104 (DINOv3).
  • Official project page / blog: Meta AI DINOv3 resources.

Information

  • Websitegithub.com
  • AuthorsMeta AI Research (FAIR), Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Hervé Jégou, Patrick Labatut, Piotr Bojanowski
  • Published date2025/08/07

More Items