Finding a single numeric KPI in a 100+ page OCR'd annual report is a classic long‑context information‑retrieval and extraction challenge. LEDGER frames that real‑world task as two complementary, measurable problems — page retrieval (which pages contain the answer) and precise numeric extraction from the full OCRed report — enabling reproducible benchmarking for models and retrieval pipelines.
What Sets It Apart
- Two-task design: per-query TREC-style page qrels let you evaluate retrieval metrics (Recall@k, MRR, nDCG) separately from the extraction task, enabling modular research on indexing, reranking, and reader models. This separation clarifies whether errors come from retrieval or from the model’s extraction logic.
- Real long contexts at scale: reports are OCR'd into page-aligned Markdown (median ~124 pages, ~126k tokens), with an
mmd_textfield and raw.mmdfiles provided for visual/format-aware methods. The eval config contains 10,000 queries across 494 reports (2017–2022), while the larger no_eval split (~104k queries over 4,505 reports, 2009–2024) supports training and development. - Ground-truth value provenance and graded qrels: KPI values are reconciled from SEC EDGAR XBRL, Yahoo Finance, and Alpha Vantage using a deterministic waterfall; per-page relevance is graded 0/1/2 and is directly compatible with trec_eval/pytrec_eval. Relevance judgments were mined via unit-normalized matching and validated by an LLM judge (Qwen 3.6-27B).
- Practical evaluation protocol: extraction success uses a numeric tolerance (default ±0.05%) and expects structured answers (value + unit scale + page number). Baseline recall/precision numbers (Qwen3.6-27B: ~91.4/93.5; others listed) give a starting point for comparison.
Who It's For — Tradeoffs
Great fit if you need a reproducible benchmark for long‑context retrieval and numeric extraction in finance (e.g., testing RAG pipelines, long‑context LLMs, OCR‑aware readers, or retrieval/rerank strategies). The dataset’s page‑aligned qrels and full OCR text make it useful for both IR and LLM evaluation. Look elsewhere if your focus is non‑financial text, short-context QA, or multilingual document corpora (this dataset is English and tailored to corporate annual reports). Also note OCR noise and table rendering mean methods must be robust to OCR artifacts and layout; heavy reliance on perfectly parsed tables will reduce applicability.
Additional notes
- Format and tooling: data provided in parquet, with mmd files and per-page images (eval). Libraries noted: datasets, dask, polars, mlcroissant. Data license: CC-BY-4.0; code: MIT.
- Typical workflows: index pages (split on
<--- Page Split --->), run retrieval, evaluate againstqrels, then pass retrieved pages or fullmmd_text+query_textto reader LLMs for extraction. The dataset is suited for research on long-context LLM prompting, hybrid retrieval+LLM pipelines, and OCR-aware extraction.
