Most public fine-tuning corpora are either small human-annotated CoT sets or huge unstructured web corpora. This dataset fills a middle niche: one million distilled reasoning traces and instruction–response examples produced for high-reasoning scenarios, with targeted coverage across coding and STEM domains.
What Sets It Apart
- High-volume reasoning traces: 1,000,000 entries and about 5 billion tokens aimed at instruction tuning and SFT, rather than raw pretraining. This density makes it practical for supervised fine-tuning or distillation experiments without full-scale pretraining costs.
- Domain-balanced subsets: roughly 50% coding, 20% science, 15% math, plus dedicated files for PHD-Science, General-Math (200k), and MultilingualSTEM (100k). That distribution supports models needing stronger programmatic and STEM reasoning abilities.
- Machine-distilled origin and tooling: collected with a modified Datagen pipeline (credited to TeichAI) over ~80 hours; designed as distilled model outputs rather than human-written gold, which simplifies scaling and reproducibility.
Who It's For and Trade-offs
Great fit if you want a large, ready-to-use supervised dataset for instruction tuning, SFT, or distillation experiments focused on coding and STEM reasoning. Useful as additional supervised signal when adapting an LLM to produce chain-of-thought explanations or to improve problem-solving in math/science/code. Look elsewhere if you need exclusively human-validated ground truth for evaluation, unbiased benchmarks, or datasets curated for safety-critical deployment—model-generated traces can propagate the source model's mistakes, biases, and hallucinations.
Where It Fits
Use this dataset as a targeted supervised signal layered on top of standard instruction corpora or as a teacher-output pool for distillation. It is not a replacement for high-quality human annotations for evaluation, but it is a pragmatic option when scaling reasoning traces for fine-tuning experiments.
