LogoAIAny
Icon for item

KIMI-K2.5-1000000x

Provides 1,000,000 model-generated chain-of-thought traces and instruction–response pairs for fine-tuning and distilled supervision. Focused splits (coding, PHD-Science, General-Math, MultilingualSTEM), ~5B tokens, Apache-2.0 license.

Introduction

Most public fine-tuning corpora are either small human-annotated CoT sets or huge unstructured web corpora. This dataset fills a middle niche: one million distilled reasoning traces and instruction–response examples produced for high-reasoning scenarios, with targeted coverage across coding and STEM domains.

What Sets It Apart
  • High-volume reasoning traces: 1,000,000 entries and about 5 billion tokens aimed at instruction tuning and SFT, rather than raw pretraining. This density makes it practical for supervised fine-tuning or distillation experiments without full-scale pretraining costs.
  • Domain-balanced subsets: roughly 50% coding, 20% science, 15% math, plus dedicated files for PHD-Science, General-Math (200k), and MultilingualSTEM (100k). That distribution supports models needing stronger programmatic and STEM reasoning abilities.
  • Machine-distilled origin and tooling: collected with a modified Datagen pipeline (credited to TeichAI) over ~80 hours; designed as distilled model outputs rather than human-written gold, which simplifies scaling and reproducibility.
Who It's For and Trade-offs

Great fit if you want a large, ready-to-use supervised dataset for instruction tuning, SFT, or distillation experiments focused on coding and STEM reasoning. Useful as additional supervised signal when adapting an LLM to produce chain-of-thought explanations or to improve problem-solving in math/science/code. Look elsewhere if you need exclusively human-validated ground truth for evaluation, unbiased benchmarks, or datasets curated for safety-critical deployment—model-generated traces can propagate the source model's mistakes, biases, and hallucinations.

Where It Fits

Use this dataset as a targeted supervised signal layered on top of standard instruction corpora or as a teacher-output pool for distillation. It is not a replacement for high-quality human annotations for evaluation, but it is a pragmatic option when scaling reasoning traces for fine-tuning experiments.

Information

Categories