AI Image2025

Dolphin

Converts document images—scans, photos, born-digital PDFs—into structured text in two stages: first map layout and reading order, then parse each element (text, tables, formulas, figures) in parallel, each guided by its own task prompt.

Visit Website

Introduction

Most document parsers force a choice: stitch together separate detectors, OCR engines, and table models that drift out of sync, or hand the whole page to a large general-purpose VLM and eat the latency. Dolphin takes a third path—analyze the layout first, then let the discovered elements prompt the parsing of their own content.

Key Capabilities

Analyze-then-parse, two stages. Stage 1 emits a sequence of layout elements in natural reading order; Stage 2 treats each element as an anchor paired with a task-specific prompt. So you are no longer bottlenecked by reading a whole page top-to-bottom—elements are parsed in parallel.
Heterogeneous anchor prompting. Text, tables, formulas, figures, and code blocks each get a different prompt, so one model handles intertwined content instead of a pipeline of one model per element type that has to be kept in sync.
Built on scale. Trained on 30M+ samples spanning multi-granularity tasks, it works at both page level and element level, and handles multi-page PDFs.
Lightweight by design. It is far smaller than a general VLM, which is where the throughput advantage comes from rather than from raw model size.

Who It's For

Great fit if you need to turn messy real-world documents—phone photos, scans, multi-column PDFs—into structured Markdown or JSON with tables and formulas preserved, and you care about throughput over a do-everything model. Look elsewhere if you want a general-purpose VLM for visual Q&A or reasoning over images: Dolphin is specialized for parsing document structure, not conversational understanding. The repo now also ships an enhanced v2 checkpoint aimed at photographed documents, so check which checkpoint matches your inputs before committing.

Back

Information

Websitegithub.com
AuthorsByteDance
Published date2025/05/13

More Items

Computer Vision Papers2026

Read It Back: Pretrained MLLMs Are Zero-Shot Reward Models for Text-to-Image Generation

Runhui Huang, Qihui Zhang +4

Uses pretrained multimodal LLMs as zero-shot, training-free reward models for text-to-image RL by scoring how well the original text prompt can be recovered from a generated image via image-conditioned prompt log-likelihood; includes a Self-SpectraReward closed-loop variant.

paper multimodal vision RL evaluation+4

AI Model2023

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Bowen Wen, Wei Yang +2NVIDIA, NVlabs +1

Estimates and tracks 6D poses of novel objects without per-object fine-tuning — supports both model-based (CAD) and model-free (few reference images) setups. Trained on large-scale synthetic data with a transformer-based architecture and contrastive learning; CVPR 2024 highlight with demos and pretrained weights.

pytorch vision robotics depth foundation-model+3

Computer Vision Papers2026

LightMem-Ego: Your AI Memory for Everyday Life

Yijun Chen, Boyi Xiao +11

Continuously records egocentric visual and audio streams into a lightweight streaming memory that organizes experiences into current, short-term, and long-term tiers and retrieves multimodal evidence to answer queries about past events. Built for on-device use (smartphones/AI glasses) with dynamic retrieval routing.

multimodal vision audio mobile code+1