Long chains of thought frequently exceed model context windows and force either very long token windows or brittle heuristics. This dataset reframes the problem: teach models to compress completed reasoning blocks into compact "mementos" and continue reasoning from summaries alone, trading raw token retention for structured, reusable state.
What Sets It Apart
- Practical scale and format: 228,557 traces across math (54%), science (27%), and code (19%), with a training-ready
defaultsplit and afullsplit that exposes sentences, block boundaries, and block summaries for analysis. - Memento-first pipeline: sentence splitting → boundary scoring → block segmentation → summary generation → judge-guided refinement (up to two rounds). The released data reports ~6× trace-level compression (from ~10,900 block tokens to ~1,850 memento tokens per trace) and median block-level compression of ~4–6× depending on domain.
- Research-friendly annotations: each example includes block indices and iteratively refined summaries (in
full), enabling re-segmentation, evaluation of summary quality, and experiments with block masking or context eviction during inference.
Who it's for + tradeoffs
Great fit if you want to fine-tune or evaluate models that must reason over very long multi-step traces (SFT for memento-style generation, experiments in context compression, or studying summary quality and iterative refinement). The dataset is already formatted for datasets.load_dataset and common Python toolchains.
Look elsewhere if you need raw, uninterpreted chain-of-thought tokens (this dataset intentionally compresses and evicts block content), if your domain is outside math/science/code, or if you require human-authored gold-standard proofs rather than LLM-judged iterative summaries.
