Document OCR and layout parsing are often split across specialized pipelines or dominated by very large models. Surya takes a different tradeoff: it unifies layout, OCR (including inline math), reading-order and table recognition inside a single ~650M-parameter vision–language model, aiming for strong end-to-end accuracy while keeping model size and inference cost constrained.
What Sets It Apart
- Pareto-efficient size/quality tradeoff — Surya scores 83.3% on olmOCR-bench while staying well under 3B parameters, so you get near state-of-the-art document parsing without a very large model footprint (so what: lower VRAM and cheaper inference for production pipelines).
- Unified VLM for layout + OCR + table-rec — one model emits layout JSON or full-page HTML (with
<math>tags) depending on prompt, which simplifies pipelines and reduces format-translation errors (so what: easier integration and fewer cascade failures across separate components). - Multilingual and reading-order aware — evaluated across an internal 91-language benchmark with broad pass rates and explicit reading-order output (so what: better cross-language robustness and downstream structure for extraction tasks).
- Practical inference choices — works with vllm on NVIDIA GPUs or llama.cpp/llama-server on CPU/Apple Silicon, and provides a manager that auto-spawns/attaches to the backend (so what: flexible deployment from local CPU to single-GPU servers).
Who It's For and Tradeoffs
Great fit if you need accurate, production-friendly document parsing that balances quality and cost: teams extracting structured text, tables, or semantics from scanned PDFs and multilingual documents who want a single-model stack. Look elsewhere if absolute top-of-the-line leaderboard scores (from very large models) are the sole priority, if your use case is natural-scene text (photos), or if your commercial licensing needs exceed the model’s modified OpenRAIL-M terms (weights are free for research/personal use and small startups; commercial licensing details are on the project site).
Where It Fits
Compared with larger document parsers, Surya occupies the lower-latency / lower-cost part of the accuracy curve: it’s a pragmatic choice when you need structured OCR + layout + table output with modest infrastructure, rather than pursuing the last percent of benchmark performance with multi-billion-parameter models.
