LogoAIAny
Icon for item

GLM-5.1-1000000x

Provides 1,003,589 full chain-of-thought reasoning traces and final answers generated by GLM-5.1, split into main/Math/PHD-Science/Multilingual-STEM subsets. Useful for instruction-tuning, supervised fine-tuning, and reasoning experiments; released under Apache-2.0.

Introduction

Why this matters

Large-scale supervised data that contains full chain-of-thought (CoT) traces is still rare at multi-hundred-thousand scale. This dataset supplies 1,003,589 reasoning traces distilled from GLM-5.1—each record pairs a user prompt with an assistant response that includes an explicit thinking trace plus a final answer—making it a practical resource when you want model behavior shaped by stepwise reasoning rather than only final-label supervision.

What Sets It Apart
  • Full CoT traces, not just labels: each assistant message contains an explicit <think>... </think> trace followed by the final answer, so you can train models to produce intermediate reasoning steps or strip them for distilled supervision.
  • Scale and domain slices: ~1.0M records (~5.36B estimated tokens) with dedicated subsets (main, Math, PHD-Science, Multilingual-STEM) that let you target general reasoning, math, graduate-level science, or multilingual STEM tasks separately.
  • Teacher provenance and reproducibility: traces were generated by GLM-5.1 and distilled from prompts originating in the KIMI-K2.5-1000000x source, which helps trace experimental lineage when comparing distilled teachers or reproducing training runs.
Who It's For (and trade-offs)

Great fit if you want to: fine-tune an LLM to produce explicit reasoning steps, evaluate reasoning robustness across STEM domains, or distill teacher behavior into a smaller model. The dataset’s size and long average trace lengths make it valuable for experiments that require many varied CoT examples.

Look elsewhere if you need only short answer labels, strictly human-annotated CoT (this is model-generated), or cleaned peer-reviewed problem sets—synthetic teacher traces can include mistakes, shortcuts, or bias patterns inherited from GLM-5.1. Also consider compute cost: training on ~5B estimated tokens (and long per-record token counts) requires substantial GPU/TPU resources or careful curriculum/selection.

Where It Fits

Use this dataset alongside smaller human-CoT resources when you want breadth and scale, and combine with validation sets of human-verified reasoning to detect teacher errors. For instruction-tuning pipelines, its Apache-2.0 license simplifies reuse, but always validate on held-out, high-quality benchmarks before deployment.

Information

Categories