AI Dataset2022

Grade School Math 8K (GSM8K)

Benchmark dataset of ~8.5k grade-school math word problems with step-by-step solutions and calculator annotations for evaluating multi-step arithmetic reasoning in language models. Provided in two configs (main and socratic) and commonly used for chain-of-thought prompting, fine-tuning, and verifier training.

Visit Website

Introduction

Most LLM weaknesses on elementary arithmetic show up not because models can’t compute single operations, but because they fail to reliably chain a handful of correct arithmetic steps under natural-language reasoning. GSM8K provides a compact, high-quality set of grade-school word problems with human-written multi-step solutions and embedded calculation annotations to isolate and measure that capability.

What Sets It Apart

Human-crafted, linguistically diverse problems (≈8.5K instances) designed to require 2–8 reasoning steps, with explicit calculator-style annotations that make intermediate arithmetic visible. This makes it suitable for evaluating both final-answer accuracy and the quality of intermediate reasoning traces.
Two configurations: “main” (question + final multi-step solution) and “socratic” (same solutions broken into guided sub-questions), enabling experiments that compare direct chain-of-thought prompting vs. stepwise Socratic approaches.
Small enough to iterate quickly (train: 7473 / test: 1319) yet widely adopted, so performance numbers are comparable across papers and leaderboards.

Who It’s For — Tradeoffs

Great fit if you need a focused benchmark to measure or improve multi-step arithmetic reasoning in English LLMs (prompting strategies, chain-of-thought, verifier training, fine-tuning). Look elsewhere if you need large-scale or domain-specific math (college-level math, symbolic manipulation), multilingual coverage, or tasks that test non-arithmetic reasoning. The dataset is licensed under MIT, making it easy to reuse, but its narrow scope means strong performance on GSM8K does not guarantee broad mathematical competence.

Back

Information

Websitehuggingface.co
AuthorsOpenAI
Published date2022/04/12

More Items

AI Dataset2026

SynthComp

t-tech

Evaluates retrievers and search agents on synthetic multi-hop questions that require assembling a complete set of supporting evidence. Provides English and Russian variants (395 questions each), a fixed dense index embedded with Qwen3-Embedding-8B, and BrowseComp-Plus evaluation integrations.

qwen evaluation retrieval web-search benchmark+6

AI Dataset2026

VideoChat3-Academic2M

MCG-NJU

Provides re-annotated academic video instruction data for captioning, video QA, and fine-grained motion understanding; rewrites short answers and concise captions into evidence-grounded, instruction-following responses and supplies JSONL annotation files (original videos not included).

video ai-video multimodal vision huggingface+1

AI Dataset2026

TRuST

t-tech

Provides 324 Russian short-answer web-search tasks with gold supporting documents to evaluate fixed-index retrievers and search agents. Tasks span eight topical categories and five retrieval challenge types (multihop, structured evidence, temporal, entity disambiguation, comparative) and use a Qwen3-Embedding-8B index for evaluation.

qwen evaluation embeddings nlp llm+4

Grade School Math 8K (GSM8K)

Introduction

What Sets It Apart

Who It’s For — Tradeoffs

Information

Categories

Tags

More Items

SynthComp

VideoChat3-Academic2M

TRuST