LogoAIAny
Icon for item

Wikimedia / Wikipedia (HuggingFace dataset)

Provides cleaned, per-language snapshots of Wikipedia articles (id, url, title, text) packaged as Hugging Face dataset configs (Parquet). Covers 300+ language configs and dated dumps — useful for language modeling, multilingual NLP, retrieval, and RAG pipelines.

Introduction

Massive, timestamped Wikipedia snapshots are a default training and evaluation source for many NLP workflows, but raw dumps include wiki markup, references, and noisy metadata. This Hugging Face dataset supplies cleaned article text organized by language and dump date (one config per language/date), so you can load a consistent, plain-text corpus for modeling or retrieval without re-parsing dumps.

What Sets It Apart
  • Per-language, dated configs: Each language + dump date is a separate Hugging Face config (e.g., 20231101.en), so you can reproduce experiments against a fixed snapshot or compare across dates.
  • Cleaned and columnar format: Articles are stripped of wiki markup and unwanted sections and stored in Parquet for efficient I/O and distributed processing at scale.
  • Extremely broad coverage: Includes hundreds of Wikipedias (from very small to very large), letting you train or evaluate multilingual models and low-resource language experiments using the same interface.
  • Clear licensing pointers: Source content is governed by Wikimedia licenses (CC-BY-SA and GFDL), which you must consider for redistribution and model release.
Who It's For & Tradeoffs

Great fit if you need reproducible, per-date Wikipedia text for language modeling, fine-tuning, retrieval/embeddings, or multilingual evaluation. It reduces preprocessing overhead and integrates with the Hugging Face datasets ecosystem.

Look elsewhere if you require curated, human-annotated labels, sentence-level quality guarantees, or up-to-the-minute article changes — dumps are snapshots and may contain outdated or biased content; some language subsets are very large (GBs–TBs) and require significant storage and bandwidth. Also review license obligations when using derived models or redistributing cleaned extracts.

Information

Categories