Most large‑model instruction-tuning efforts struggle with noisy or inconsistently formatted chain-of-thought data; that noise complicates parsing, deduplication, and SFT pipelines. This cleaned derivative tackles that friction by converting mixed answer layouts into a unified, training-ready schema while keeping the original subset structure tailored to general reasoning, graduate STEM, multilingual STEM, and math-heavy proofs.
What Sets It Apart
- SFT-ready normalization: mixed reasoning wrappers (explicit
<think>tags and ad-hoc dash wrappers) are normalized into a single output field and a conversations array, so downstream pipelines can ingest prompts and reasoning traces without ad-hoc parsing rules. This reduces preprocessing engineering for instruction tuning. - Scale with provenance: derived from Kassadin88’s one-million GLM-5.1 traces but filtered and reformatted — 766,535 records were processed and 746,321 kept after cleaning (20,214 removed for truncation, duplication, refusals, or parse errors). The dataset records per-subset statistics (token medians, file sizes) useful for planning storage and training budgets.
- Subset specialization: preserves four subsets (main, PHD-Science, Multilingual‑STEM, Math), so you can target general reasoning, graduate-level sciences, multilingual STEM examples, or math/proof-style long-form outputs independently.
Who It's For and Trade-offs
Great fit if you need a large, preprocessed corpus of model-generated reasoning traces for SFT, distillation, or evaluation of chain-of-thought capabilities — particularly when you want minimal custom parsing. It’s also useful for multilingual STEM experiments and for fine-tuning models on long-form mathematical reasoning (note the Math subset has very large output-token medians). Look elsewhere if you require human-verified ground truth reasoning (this is a model-derived teacher trace collection and may inherit hallucinations, confabulations, or style artifacts from GLM-5.1), or if you need very small, perfectly curated datasets for safety-critical deployment without additional vetting. Also plan for storage and tokenization costs: some subsets have extremely long outputs (Math median output tokens and high P95 values) which affect batching and compute.
Where It Fits
Use this dataset as a mid-to-large scale source for: instruction tuning / SFT pipelines, teacher-model distillation workflows, benchmarking LLM chain‑of‑thought behavior across domains, and targeted experiments on multilingual STEM or math reasoning. Because it keeps per-example meta (input/output token counts, teacher model), it’s straightforward to filter by length or domain before training.
