Benchmarks document-parsing systems on real-world enterprise PDFs and images—evaluates tables, charts, content faithfulness, semantic formatting, and visual grounding with human-verified, rule-level tests. Ships with ~2,000 pages, ~169K test rules, and an open evaluation framework for end-to-end pipeline scoring.
Provides ~1M synthetic Salvadoran‑Spanish personas (148k records, ~300M tokens) grounded in 2024 census distributions for demographics, occupations and locations; intended for training/evaluating localized LLMs and synthetic-data workflows. CC BY 4.0, adults only.
Why this matters
Enterprise document parsing is a common but brittle dependency for downstream agent workflows: a single mislocated header, dropped sentence, or mislabeled chart value can silently corrupt analytics, auditing, or automated actions. ParseBench targets those precise failure modes by converting human-verified annotations into dense, rule-level tests so teams can see not just overall accuracy but exactly where and how a parser fails in production-like documents.
Great fit if you: maintain or build document-parsing pipelines for regulated or audit-sensitive domains (finance, insurance, reporting), integrate vision+OCR+NLP components, or need diagnostic signals to prioritize engineering work. ParseBench is designed to reveal edge-case failures that break agentic workflows.
Look elsewhere if you: only need coarse text-extraction quality for casual indexing or search (lighter OCR datasets may be faster), or if your documents are in narrowly constrained, homogeneous formats absent in ParseBench (then a bespoke synthetic test might be more cost-effective).
ParseBench complements large-scale synthetic OCR/text datasets by focusing on production-like, heterogeneous enterprise documents and on evaluation design (rule-level spot checks and structural metrics) rather than on pretraining data scale. Use it for benchmark-driven improvement and release gating rather than for model pretraining.
Each dimension uses specialized metrics: tables use a GTRM (GriTS + TableRecordMatch) structural score; charts verify chart_data_point extraction with configurable tolerances; content faithfulness uses rule-based omission/hallucination checks at word/sentence/digit levels; semantic formatting tests preservation of styles (bold, superscript, strikeout, LaTeX); layout tests grounding via bounding-box, class, and reading-order assertions. Evaluation artifacts are provided as JSONL rule files for reproducible scoring.