Regulatory filings are the canonical source for corporate disclosures, but EDGAR's raw SGML/HTML/PDF mix and scale make it hard to use directly for model training or retrieval. This release centralizes the heavy lifting: full crawl of major SEC filings, parsed plaintext, and per-document metadata and token counts, so researchers and engineers can build finance-focused LLMs and retrieval systems without redoing large-scale crawling and parsing.
What Sets It Apart
- Scale with clear accounting: ~8,055,455 major filings totalling ~590 GB of dataset content and ~43.7 billion tokens (Comma v0.1 tokenizer). So what: you get training-scale finance text with per-filing token statistics for dataset budgeting and sampling.
- Parsed + raw retention: raw filing contents (SGML/HTML/PDF) are included alongside extracted plaintext produced via selectolax + modified doc2dict and secsgml. So what: you can rely on ready-to-use plaintext or re-run custom parsing if you need different table handling or tokenization.
- Filing-level metadata and coverage: accession numbers, filing dates, filer metadata (CIK, SIC, incorporation state), and document-level fields. So what: enables filtering by filing type, ticker, date range, or industry for focused fine-tuning or retrieval indices.
- Tokenization-aware release: token counts per filing using a BPE tokenizer (Comma v0.1). So what: simplifies cost estimates for pretraining/fine-tuning and batching decisions.
Who It's For + Tradeoffs
Great fit if you need a large, finance-specific corpus for training or retrieval—examples: company disclosure LLMs, event extraction, earnings-related RAG systems, or compliance search. The dataset reduces engineering overhead (crawling + parsing) and provides raw inputs if you need different parsing logic. Look elsewhere if you need continuously real-time filings (this is a snapshot and has discrete updates) or if you require processed financial tables normalized into structured accounting ledgers out-of-the-box—those tasks still need custom normalization. Also consider compute and storage costs: working with the full corpus requires substantial disk and RAM for indexing or batching; sampling by filing type (e.g., 10-K, 10-Q) is recommended for narrower tasks.
Where It Fits
Compared with general web corpora, this dataset is narrowly focused on regulatory disclosure language and structured metadata, making it better for financial-domain LM training and retrieval than mixed-domain crawls. It complements general-purpose corpora when you need high-quality, labeled financial filings for finance-specific tasks.
