LogoAIAny
Icon for item

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Dolphin is an open-source document image parsing project from ByteDance that uses heterogeneous anchor prompting and a document-type-aware two-stage architecture. It handles both digital-born and photographed documents, offering page-level and element-level parsing (text, tables, formulas, code). Dolphin-v2 (3B) improves accuracy and adds multi-page PDF support, deployment recipes (vLLM, TensorRT-LLM), and Hugging Face model hosting. The repository includes code, demos, pretrained models, and a BibTeX citation; license: MIT.

Introduction

Dolphin — Document Image Parsing via Heterogeneous Anchor Prompting

Dolphin is an open-source project (ByteDance) and accompanying research that tackles document image parsing across diverse document types. It centers on a document-type-aware two-stage architecture and a novel heterogeneous anchor prompting mechanism to robustly parse complex page content.

Core ideas
  • Two-stage, document-type-aware pipeline:

    1. Document type classification (digital vs photographed) plus layout analysis with reading-order prediction.
    2. Hybrid parsing: for photographed documents use holistic parsing; for digital documents use parallel element-wise parsing to exploit structural regularities.
  • Heterogeneous Anchor Prompting: uses different anchor prompts tailored to element categories (text paragraphs, tables, formulas, figures, code blocks) to guide a single VLM-based parser to produce structured outputs efficiently.

Key features
  • Supports page-level parsing (produce structured JSON/Markdown for full pages) and element-level parsing (tables, formulas, text, code).
  • Dolphin-v2: a larger 3B-parameter model with expanded element detection (21 element types), attribute extraction, dedicated formula/code parsing, and stronger photographed-document handling.
  • Efficiency: designed for lightweight inference and parallel element decoding; offers deployment support including vLLM and TensorRT-LLM for accelerated inference.
  • Practical tooling: demo scripts for layout, page, and element parsing; instructions to fetch pretrained weights from Hugging Face; multi-page PDF parsing support.
  • Open-source license: MIT. Includes BibTeX citation for academic use.
Performance & releases (high level)
  • The project reports substantial improvements across page- and element-level metrics on OmniDocBench (v1.5) and related benchmarks. Dolphin-v2 is reported to reach high overall parsing scores compared to earlier Dolphin versions.
  • Changelog highlights: model/code release and demo in May 2025, vLLM and TensorRT-LLM support added June 2025, and Dolphin-v2 released in December 2025.
Usage & integration
  • Clone the repo, install requirements, and download pretrained weights (Hugging Face model card provided).
  • Example scripts: demo_page.py, demo_layout.py, demo_element.py for different parsing granularities; CLI examples are provided in the repository.
  • Deployment: example integration with vLLM and TensorRT-LLM for production/inference speedups.
Target applications
  • Document OCR and understanding, automated information extraction from reports/forms/research papers, table and formula parsing, digital archive structuring, and any downstream NLP/knowledge extraction pipelines that require structured representations of document pages.
Notes
  • The repository bundles code, pretrained models (Hugging Face), demo data and README guides. Community contributions such as edge-case reports are invited via issues.

(Information summarized from the project's GitHub README and changelog maintained in the repository.)

Information

  • Websitegithub.com
  • AuthorsByteDance
  • Published date2025/05/13

Categories

More Items