Parses document images into structured page- and element-level outputs (layout, text, tables, formulas, code) using a document-type-aware two-stage VLM and heterogeneous anchor prompting for efficient parallel parsing.
Most document parsers treat every page the same, but photographed pages and digitally-born PDFs pose very different challenges for layout and element extraction. Dolphin's central insight is to first detect document type and then apply a tailored parsing strategy—holistic parsing for photographed pages and parallel element-wise decoding for digital pages—so a single model handles both reliably without inflating latency.
Great fit if you need a single, research-grade system that parses diverse document types (scanned photos, screenshots, digital PDFs) and extracts structured elements including formulas and code blocks; or if you want an extensible codebase to run benchmarks and integrate Hugging Face model cards. Look elsewhere if you require an out-of-the-box hosted API (this is a research/code repo that requires model download and deployment) or if you need extreme low-latency on tiny edge devices without using the smaller 0.3B variants.
Compared to single-strategy systems (pure OCR+heuristics or monolithic layout transformers), Dolphin explicitly separates document-type handling and element decoding which narrows common failure modes for photographed vs. digital documents. It sits between academic baselines and production-oriented toolchains—useful for teams that want strong parsing accuracy plus the ability to optimize inference for specific infrastructure.