Why this matters
High-quality chain-of-thought traces let researchers inspect, evaluate, and fine-tune how LLMs reason—not just what they answer. This dataset exposes Opus 4.7's full working (Restatement → Approach → Step-by-step derivation → Verification) plus a concise, polished lesson-style answer for each hard prompt, making it useful for both quantitative evaluation and qualitative analysis of reasoning behavior.
What Sets It Apart
- Model-level full-workings: each sample preserves the model's raw
<think>block (detailed multi-step derivation) rather than a short rationale or edited summary, enabling fine-grained analysis of chain-of-thought structure and error modes. So what: you can study internal chains, token-level strategies, and where verification fails. - Hard-problem focus and breadth: 2,405 traces drawn from theorem-style math, MMLU-level science and formal subjects, GPQA, and competition math sources. So what: it targets cases where reasoning, not retrieval, determines success—useful for probing capability boundaries and adversarial evaluation.
- Generation + judge pipeline: samples were generated with Claude Opus 4.7 at high effort and filtered by an LLM-as-judge, keeping only accepted traces. So what: the dataset includes synthetic but quality-gated model outputs, suitable for synthetic-data experiments and model-of-model evaluation.
- Research-friendly format: parquet splits (train/valid/test and a full config) and metadata per sample (tokens, timings, source dataset), which helps reproducible experiments at scale.
Who It's For and Trade-offs
Great fit if you want to: analyze LLM step-by-step reasoning, train or evaluate models on chain-of-thought supervision, or build diagnostic benchmarks for math/science reasoning. Look elsewhere if you need human-verified ground truth (this dataset is synthetic and judged by an LLM), or if your license requires unrestricted commercial use—the dataset is released for non-commercial research and is governed by Anthropic's usage policy.
Where It Fits
Positioned between human-authored CoT corpora (higher annotation cost, human correctness) and purely synthetic rationale corpora (large but lower fidelity). It is most useful when you need sizable, model-native working traces to study reasoning patterns, perform imitation learning, or test verification/calibration methods on hard problems.
