LogoAIAny
Icon for item

Wikimedia Structured Contents Dataset

Provides pre-parsed Parquet snapshots of English and French Wikipedia articles with structured fields (sections, infoboxes, tables, references, images) and credibility signals — optimized for large-scale analysis, retrieval-augmented generation, and model development.

Introduction

Why this matters

Pre-parsed Wikipedia at scale removes a major engineering bottleneck for ML teams: converting heterogeneous wikitext and HTML into consistent, queryable structures. This dataset supplies a pinned Arrow/Parquet schema across ~10.5M article rows (English + French), with parsed references, tables, infoboxes, article sections, images, and credibility signals — so researchers and engineers can focus on modeling and evaluation instead of fragile scraping and parsing.

What Sets It Apart
  • Unified, pinned schema (Parquet/Arrow): every shard conforms byte-for-byte to a single schema, making it immediately usable with DuckDB, Polars, pandas, PyArrow, and streaming loaders. This reduces preprocessing variance when training or benchmarking models.
  • Rich structural parsing: references and citations are parsed and linked (including scoring/credibility signals), tables are extracted as first-class objects, and infoboxes/sections are normalized. That makes this dataset especially useful for RAG index construction, citation-aware generation, and table-based extraction tasks.
  • Production-oriented packaging: data is sharded into Parquet files with zstd compression and includes examples and code snippets for streaming loads and DuckDB queries, enabling large-batch analytics and single-query exploration without heavy ETL.
Who it's great for — and tradeoffs

Great fit if you need a ready-made, structured Wikipedia corpus for building retrieval indices, training/fine-tuning LLMs, benchmarking extraction or table understanding models, or doing large-scale analysis across English and French Wikipedia. The dataset’s credibility signals and links to Wikidata QIDs help when you need provenance-aware retrieval.

Look elsewhere if you require raw wikitext or a wider set of language editions: this release is a beta snapshot focused on the English and French main namespaces, and some highly polymorphic fields are stored as JSON-encoded strings to preserve a unified Parquet schema. Very deep nesting may be flattened to meet columnar-tooling limits.

Practical notes
  • Snapshot metadata: total rows ~10,468,881; Parquet size ~44.42 GiB; snapshot extraction timestamp: 2026-05-13. Use streaming readers or DuckDB patterns provided in the dataset card to query shards directly (hf:// path).
  • Attribution & license: dataset distributed under CC BY-SA 4.0; follow the Wikimedia Attribution Framework and surface source links and modification notices when reusing content in downstream systems.
  • Common preprocessing: infoboxes, tables, and references[].metadata may be JSON-encoded strings — apply json.loads() during ingestion. Also consider filtering by version.identifier to avoid duplicates.

Where it fits

This dataset sits between raw Wikimedia dumps (which require heavy parsing) and curated task datasets (which are smaller): it’s ideal when you want large, structured encyclopedic content ready for indexing, grounding, or model training without building a bespoke parser.

Information

  • Websitehuggingface.co
  • AuthorsWikimedia Enterprise, Wikimedia Foundation
  • Published date2024/09/19

Categories