AIAny - Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Dolphin — Document Image Parsing via Heterogeneous Anchor Prompting

Dolphin is an open-source project (ByteDance) and accompanying research that tackles document image parsing across diverse document types. It centers on a document-type-aware two-stage architecture and a novel heterogeneous anchor prompting mechanism to robustly parse complex page content.

Core ideas

Two-stage, document-type-aware pipeline:
1. Document type classification (digital vs photographed) plus layout analysis with reading-order prediction.
2. Hybrid parsing: for photographed documents use holistic parsing; for digital documents use parallel element-wise parsing to exploit structural regularities.
Heterogeneous Anchor Prompting: uses different anchor prompts tailored to element categories (text paragraphs, tables, formulas, figures, code blocks) to guide a single VLM-based parser to produce structured outputs efficiently.

Key features

Supports page-level parsing (produce structured JSON/Markdown for full pages) and element-level parsing (tables, formulas, text, code).
Dolphin-v2: a larger 3B-parameter model with expanded element detection (21 element types), attribute extraction, dedicated formula/code parsing, and stronger photographed-document handling.
Efficiency: designed for lightweight inference and parallel element decoding; offers deployment support including vLLM and TensorRT-LLM for accelerated inference.
Practical tooling: demo scripts for layout, page, and element parsing; instructions to fetch pretrained weights from Hugging Face; multi-page PDF parsing support.
Open-source license: MIT. Includes BibTeX citation for academic use.

Performance & releases (high level)

The project reports substantial improvements across page- and element-level metrics on OmniDocBench (v1.5) and related benchmarks. Dolphin-v2 is reported to reach high overall parsing scores compared to earlier Dolphin versions.
Changelog highlights: model/code release and demo in May 2025, vLLM and TensorRT-LLM support added June 2025, and Dolphin-v2 released in December 2025.

Usage & integration

Clone the repo, install requirements, and download pretrained weights (Hugging Face model card provided).
Example scripts: demo_page.py, demo_layout.py, demo_element.py for different parsing granularities; CLI examples are provided in the repository.
Deployment: example integration with vLLM and TensorRT-LLM for production/inference speedups.

Target applications

Document OCR and understanding, automated information extraction from reports/forms/research papers, table and formula parsing, digital archive structuring, and any downstream NLP/knowledge extraction pipelines that require structured representations of document pages.

Notes

The repository bundles code, pretrained models (Hugging Face), demo data and README guides. Community contributions such as edge-case reports are invited via issues.

(Information summarized from the project's GitHub README and changelog maintained in the repository.)

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Introduction

Dolphin — Document Image Parsing via Heterogeneous Anchor Prompting

Core ideas

Key features

Performance & releases (high level)

Usage & integration

Target applications

Notes

Information

Categories

Tags

More Items

Fooocus

Chandra

Deep-Live-Cam