LogoAIAny
Icon for item

Nemotron-SFT-Math-v3

Large-scale mathematical reasoning dataset of model-generated solution trajectories produced with and without Python Tool-Integrated Reasoning (TIR), with final answers verified against reference solutions. Contains ~3.64M JSONL training samples (~144 GB) and per-source CC-BY / CC-BY-SA licensing; intended for training and evaluating tool-augmented mathematical reasoning in LLMs.

Introduction

Why this matters

Nemotron-SFT-Math-v3 provides paired reasoning trajectories (chain-of-thought and Python tool-integrated runs) at scale, letting researchers and engineers directly compare how tool use changes solution structure and correctness. Instead of only supplying problems and answers, it supplies multiple verified reasoning traces per problem — a practical resource for training, probing, and benchmarking tool-augmented LLM behaviors.

What Sets It Apart
  • Paired reasoning regimes: each problem includes trajectories generated both with and without Python Tool-Integrated Reasoning (TIR), enabling controlled comparisons of tool use vs. pure language reasoning and supporting contrastive training objectives.
  • Verification-first pipeline: solutions were accepted only when the generated final answer matched the verified reference from Nemotron-Math-v2, reducing noisy or incorrect trajectories and increasing dataset reliability for supervised fine-tuning and evaluation.
  • Scale and provenance: the train split contains ~3,638,783 JSONL samples (~144 GB). Problems are drawn from curated AoPS-style competition problems and Math StackExchange/MathOverflow posts, with per-sample licensing (CC-BY for AoPS-derived samples; CC-BY-SA for StackExchange-derived samples).
  • Reproducible generation: all data-generation components use NVIDIA's NeMo-Skills pipeline and DeepSeek model variants (DeepSeek-V3.2-Speciale for CoT, DeepSeek-V3.2 for TIR), which makes the generation recipe auditable and reusable.
Who it's for — and tradeoffs

Great fit if you want to: train or fine-tune LLMs to produce verifiable multi-step mathematical solutions; study differences between tool-augmented and language-only reasoning; build verification or answer-consensus pipelines. The dataset is explicitly packaged for large-scale supervised training and analysis.

Look elsewhere if you need: human-written step-by-step proofs as primary data (many solutions here are model-generated), or datasets without any downstream licensing heterogeneity — Nemotron-SFT-Math-v3 mixes CC-BY and CC-BY-SA samples and documents per-sample licenses. Also be mindful that model-generated trajectories can reflect the biases and failure modes of the generator models despite answer verification; incorporate held-out human evaluation where safety or high-assurance correctness matters.

Additional notes

Created on 2026-01-23 and last modified 2026-04-27 (formatting fixes), the dataset is available on Hugging Face and marked ready for commercial use under the documented per-source licenses. Relevant artifacts include the Nemotron-Math-v2 problem set, the OpenMathReasoning extraction, and the NeMo-Skills generation tooling.

Information

  • Websitehuggingface.co
  • AuthorsNVIDIA Corporation
  • Published date2026/01/23

Categories