A 57-subject multiple-choice benchmark for measuring broad language understanding in LLMs; provides per-subject configs and test/dev/auxiliary_train splits for few-/zero-shot evaluation, widely used for model comparison and academic reporting.
Provides cleaned, per-language snapshots of Wikipedia articles (id, url, title, text) packaged as Hugging Face dataset configs (Parquet). Covers 300+ language configs and dated dumps — useful for language modeling, multilingual NLP, retrieval, and RAG pipelines.
Benchmark dataset of ~8.5k grade-school math word problems with step-by-step solutions and calculator annotations for evaluating multi-step arithmetic reasoning in language models. Provided in two configs (main and socratic) and commonly used for chain-of-thought prompting, fine-tuning, and verifier training.
Canonical ILSVRC ImageNet-1k for 1,000-way image classification — provides roughly 1.2M labeled images (train/val/test) packaged as optimized Parquet for easy loading with Hugging Face Datasets, Dask, and Polars. Verify licensing and distribution constraints before use.
Provides human preference comparison pairs and red-team conversation transcripts collected by Anthropic for training preference/reward models and studying harmful model behaviors; intended for RLHF and safety research, not for supervised fine-tuning of dialogue agents.
Community-curated collection of ChatGPT-style prompts mirrored as a Hugging Face dataset; organized by task and model compatibility for quick reuse. Useful for prompt engineering, text-generation prototyping, and building conversational examples across multiple LLMs.
Contains short, small-vocabulary stories synthetically generated by GPT-3.5 and GPT-4 for training and evaluating compact language models. Includes multiple splits, a GPT-4-only V2 subset, and archive files with prompts and metadata for reproducible experiments.
Provides a multilingual, deduplicated corpus of public source code in Parquet for large-scale model training and evaluation. Includes license metadata, language splits, and streaming-friendly packaging for use with Hugging Face Datasets — suited to training code-focused foundation models but requires careful license/provenance review.
Provides 300k annotated multilingual text examples for identifying and masking personally identifiable information (PII) across multiple domains and languages (EN, FR, DE, IT, ES, NL). Intended for training and evaluating token-level PII detection and masking models; includes a DOI for citation.
Provides a cleaned, deduplicated English web corpus optimized for LLM pretraining—over 15T tokens aggregated from CommonCrawl with per-dump snapshots and smaller sampled configs (10B/100B/350B). Includes the datatrove processing pipeline, MinHash deduplication, and an ODC-By v1.0 license; suited for large-scale model training and ablation studies but not specialized for code.
Provides ~1.3 trillion tokens of web pages filtered for educational quality using an LLM-trained classifier; includes per-Crawl configs, smaller random samples (10B/100B/350B tokens), and the classifier code and model for reproducible filtering.
Provides leaderboard-ready test splits for the Open ASR Leaderboard: converts unsafe custom loaders to Parquet, sorts samples by audio length, and packages eight ESB test sets (LibriSpeech, Common Voice, GigaSpeech, SPGISpeech, etc.) for reproducible ASR benchmarking.