AIAny - AI Dataset

AI Dataset2022

Measuring Massive Multitask Language Understanding (MMLU)

cais (Hugging Face dataset curator), Dan Hendrycks et al. (original MMLU paper)

A 57-subject multiple-choice benchmark for measuring broad language understanding in LLMs; provides per-subject configs and test/dev/auxiliary_train splits for few-/zero-shot evaluation, widely used for model comparison and academic reporting.

huggingface NLP LLM pandas transformers+1

AI Dataset2022

GitHub Code Dataset

codeparrot

Provides 115M public GitHub source files (≈873GB of code, ~1TB uncompressed) with per-file metadata (repo, path, language, license). Supports streaming, language/license filtering and full download for training and evaluating code LLMs and code generation models.

github code huggingface llm ai-coding+3

AI Dataset2022

Wikimedia / Wikipedia (HuggingFace dataset)

Wikimedia

Provides cleaned, per-language snapshots of Wikipedia articles (id, url, title, text) packaged as Hugging Face dataset configs (Parquet). Covers 300+ language configs and dated dumps — useful for language modeling, multilingual NLP, retrieval, and RAG pipelines.

huggingface multilingual nlp LLM transformers+2

AI Dataset2022

Grade School Math 8K (GSM8K)

OpenAI

Benchmark dataset of ~8.5k grade-school math word problems with step-by-step solutions and calculator annotations for evaluating multi-step arithmetic reasoning in language models. Provided in two configs (main and socratic) and commonly used for chain-of-thought prompting, fine-tuning, and verifier training.

math nlp openai huggingface paper+1

AI Dataset2022

ILSVRC/imagenet-1k

ILSVRC

Canonical ILSVRC ImageNet-1k for 1,000-way image classification — provides roughly 1.2M labeled images (train/val/test) packaged as optimized Parquet for easy loading with Hugging Face Datasets, Dask, and Polars. Verify licensing and distribution constraints before use.

vision image huggingface ai-image

AI Dataset2022

HH-RLHF (Anthropic)

Anthropic

Provides human preference comparison pairs and red-team conversation transcripts collected by Anthropic for training preference/reward models and studying harmful model behaviors; intended for RLHF and safety research, not for supervised fine-tuning of dialogue agents.

anthropic huggingface RL NLP LLM+2

AI Dataset2022

prompts.chat (Awesome ChatGPT Prompts)

fka

Community-curated collection of ChatGPT-style prompts mirrored as a Hugging Face dataset; organized by task and model compatibility for quick reuse. Useful for prompt engineering, text-generation prototyping, and building conversational examples across multiple LLMs.

prompt-engineering chatbot chatgpt llm ai+3

AI Dataset2023

IlyaGusev/habr

IlyaGusev

Contains tech-blog posts scraped from Habr (primarily Russian, some English) in Parquet format with ~100K–1M records. Suited for multilingual text-generation and language-model fine-tuning; license is not specified, so verify before redistribution.

huggingface nlp multilingual pandas polars+1

AI Dataset2023

TinyStories

roneneldan

Contains short, small-vocabulary stories synthetically generated by GPT-3.5 and GPT-4 for training and evaluating compact language models. Includes multiple splits, a GPT-4-only V2 subset, and archive files with prompts and metadata for reproducible experiments.

huggingface NLP LLM prompt-engineering transformers+1

AI Dataset2024

bigcode/the-stack-v2

bigcode

Provides a multilingual, deduplicated corpus of public source code in Parquet for large-scale model training and evaluation. Includes license metadata, language splits, and streaming-friendly packaging for use with Hugging Face Datasets — suited to training code-focused foundation models but requires careful license/provenance review.

huggingface code foundation-model llm ai-train+3

AI Dataset2024

ai4privacy/pii-masking-300k

ai4privacy

Provides 300k annotated multilingual text examples for identifying and masking personally identifiable information (PII) across multiple domains and languages (EN, FR, DE, IT, ES, NL). Intended for training and evaluating token-level PII detection and masking models; includes a DOI for citation.

huggingface nlp privacy translation multilingual+1

AI Dataset2024

FineWeb

HuggingFaceFW

Provides a cleaned, deduplicated English web corpus optimized for LLM pretraining—over 15T tokens aggregated from CommonCrawl with per-dump snapshots and smaller sampled configs (10B/100B/350B). Includes the datatrove processing pipeline, MinHash deduplication, and an ODC-By v1.0 license; suited for large-scale model training and ablation studies but not specialized for code.

huggingface llm nlp foundation-model ai-train+1

Category

Explore by categories

All Categories

AI Leaderboard

AI Agent Tutorials

AI Coding Tutorials

AI Model

AI Agent Papers

Chatbot

AI Dataset

Machine Learning Foundation Books

AI Train

AI Deploy

AI Client

Machine Learning Foundation Papers

Machine Learning Foundation Tutorials

AI Image Demos

AI Agent

Large Language Model Tutorials

Large Language Model Papers

Machine Learning Engineering Papers

Computer Vision Tutorials

Computer Vision Papers

Natural Language Processing Papers

Reinforcement Learning Papers

Speech Technology Papers

AI API

AI Coding

AI Image

AI Video

MLOps

MCP Client

MCP Server

AI Video Papers

AI Audio

AI Others

AI Infra

Embodied AI

Measuring Massive Multitask Language Understanding (MMLU)

GitHub Code Dataset

Wikimedia / Wikipedia (HuggingFace dataset)

Grade School Math 8K (GSM8K)

ILSVRC/imagenet-1k

HH-RLHF (Anthropic)

prompts.chat (Awesome ChatGPT Prompts)

IlyaGusev/habr

TinyStories

bigcode/the-stack-v2

ai4privacy/pii-masking-300k

FineWeb