Nemotron-Math-v2
收藏魔搭社区2026-01-09 更新2025-12-27 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Nemotron-Math-v2
下载链接
链接失效反馈官方服务:
资源简介:
# Nemotron-Math-v2
This repository contains the dataset accompanying the paper [Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision](https://arxiv.org/abs/2512.15489).
Code: [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills)
Documentation: [NeMo-Skills Nemotron-Math-v2 Documentation](https://nvidia-nemo.github.io/Skills/releases/nemotron-math-v2/)
## Dataset Description
Nemotron-Math-v2 is a large-scale mathematical reasoning dataset containing approximately 347K high-quality mathematical problems and 7M model-generated reasoning trajectories. The dataset integrates human-authored problem sets with systematically generated solution traces produced under multiple reasoning modes and tool-use configurations.
Each problem is solved multiple times by the [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) model under six settings (high/medium/low reasoning × with/without Python TIR). Answers are verified using an LLM-as-a-judge pipeline, and trivial or unreliable problems are removed through pass-rate filtering. Only solutions whose final answers match the verified reference are included, resulting in a challenging, clean, and high-quality dataset suitable for training and evaluating mathematical reasoning systems.
All components of the pipeline, including problem extraction and data generation, are implemented using [NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills). For detailed information, please refer to the **[official documentation](https://nvidia-nemo.github.io/Skills/releases/nemotron-math-v2/)**.
<br><br>
This dataset is ready for commercial use.
## Dataset Owner(s):
NVIDIA Corporation
## Dataset Creation Date:
Created on: Dec 3, 2025
Last Modified on: Dec 18, 2025
## License/Terms of Use:
The Math GPT-OSS AOPS dataset is governed by the [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
The Math GPT-OSS StackOverflow and MathGenSelect datasets are governed by the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by/4.0/).
## Intended Usage:
This dataset is intended for:
* Training LLMs to perform structured mathematical reasoning
* Studying tool-augmented reasoning vs. pure language reasoning
* Building long-context or multi-trajectory reasoning systems
* Evaluating LLM reasoning robustness and solution diversity
* Research on reasoning modes, error patterns, and verification pipelines
### Dataset Composition and Generation
#### Problem Extraction
This dataset is constructed from AoPS and StackExchange-Math forums, but we do not use raw posts directly. Because forum threads contain discussion, commentary, and sometimes multiple or incomplete questions, we first use an LLM to perform problem extraction, isolating explicit mathematical problem statements from the original threads. Each extracted problem is then passed through a series of LLM-based classifiers to determine whether it is a proof-style question, a multiple-choice question, a binary yes/no question, or an invalid or context-dependent prompt; all such items are removed. For questions originally posed in proof format, we apply a proof-to-answer transformation that attempts to rewrite them into answer-based tasks while preserving conceptual difficulty, whereas for non-proof questions we attempt to extract the final answer from the discussion rather than the full solution. We further perform benchmark decontamination by removing problems that overlap with public math datasets. Although our pipeline includes a proof-conversion step, we ultimately discard all converted proof questions, as our goal is to retain only problems that admit clearly verifiable final answers. The final dataset therefore consists solely of nontrivial, high-quality mathematical problems.
##### AoPS Subset
The AoPS subset is derived from the [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning) dataset, originally sourced from the Art of Problem Solving (AoPS) community.
**Characteristics:**
- Competition-style problems across algebra, geometry, number theory, and combinatorics
- Proof-style questions removed to ensure answer verifiability
- Difficulty filtering removes problems too easily solved by the model
- Final subset size: **~85K problems** with validated reference answers
##### StackExchange-Math Subset
The StackExchange-Math Subset consists of problems collected from [Math StackExchange](https://math.stackexchange.com/) and [MathOverflow](https://mathoverflow.net/), covering a wide range from undergraduate-level to research-oriented topics.
**Characteristics:**
- Proof-style questions filtered via an LLM classifier
- Decontaminated to avoid overlap with public benchmarks
- Difficulty filtering removes trivial items
- Final subset size: **~262K problems**
---
#### Reasoning Trace Generation
A unified pipeline is used to generate solution traces for all problems.
##### Reasoning Configurations
Each problem is solved under **six configurations**:
- Reasoning depth: high, medium, low
- Tool usage: with Python TIR, without Python TIR
##### Sampling
- **8 solutions per configuration** using different random seeds
- Temperature = 1.0, top-p = 1.0
##### Answer Verification
Reference answers are established through the following procedure:
- If a problem includes an extracted answer from the forum (AoPS, or StackExchange), the answer is retained only if at least one of the 16 high-reasoning model-generated solutions (8 with Python TIR, 8 without) produces a final answer judged consistent with it.
- If no extracted answer is available, or if all model-generated solutions disagree with the extracted answer, the reference answer is replaced with the majority vote among the 16 high-reasoning model outputs.
##### Filtering
- Problems with a pass rate above 0.8 under low-reasoning settings are removed
- Incorrect solutions are discarded via automated LLM-judge evaluation
##### Final Output
The final dataset contains approximately **7.0M filtered reasoning trajectories** (from the original ~7.5M trajectories), reflecting diverse reasoning strategies, tool interactions, and long-form solution patterns.
#### Dataset fields
OpenMathReasoning dataset contains the following fields:
- **problem**: Problem statement derived from [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning), and [Math StackExchange](https://math.stackexchange.com/) and [MathOverflow](https://mathoverflow.net/).
- **messages**: user and assistant turns in standardized messages format for LLM training.
- **expected_answer**: Extracted answer if "problem_type" is "has_answer_extracted". Otherwise this is the majority-voting answer across all generated solutions for this problem.
"changed_answer_to_majority": true, or false, this label is set to `true` only if an extracted forum answer existed and was replaced by the majority-vote answer from the high-reasoning model solutions (i.e., when all model-generated solutions disagreed with the extracted answer). Otherwise, it is `false` (including cases with no forum answer).
- **metadata**: pass rates on different reasoning regimes and tool usage (list)
- **data_source**: AoPS or StackExchange-Math
- **tool**: empty for rows without available tools, python tool definition for rows with tool available.
- **url**: ‘the hyperlink of the question
- **user_url**: ‘the hyperlink of the user
- **user_name**: user name of the questions
## Dataset Characterization
**Data Collection Method**
Hybrid: Automated, Synthetic
## Dataset Format
Modality: Text
Format: JSONL
Structure: Text + Metadata
## Reference(s):
Link to [paper](https://arxiv.org/abs/2512.15489).
BibTeX for citation:
```bibtex
@article{du2025nemotronmath,
title = {Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision},
author = {Du, Wei and Toshniwal, Shubham and Kisacanin, Branislav and Mahdavi, Sadegh and Moshkov, Ivan and Armstrong, George and Ge, Stephen and Minasyan, Edgar and Chen, Feng and Gitman, Igor},
journal = {arXiv preprint arXiv:2512.15489},
year = {2025}
}
```
## Dataset Quantification
| Subset | Samples |
|--------|---------|
| low | 1,718,159 |
| medium | 2,502,305 |
| high | 2,865,375 |
| Total | 7,085,839 |
Total Disk Size: ~143GB
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)
# Nemotron-Math-v2
本仓库配套收录了论文《Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision》(https://arxiv.org/abs/2512.15489)所使用的数据集。
代码仓库:[NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills)
文档:[NeMo-Skills Nemotron-Math-v2 官方文档](https://nvidia-nemo.github.io/Skills/releases/nemotron-math-v2/)
## 数据集描述
Nemotron-Math-v2 是一个大规模数学推理数据集,包含约34.7万道高质量数学题目与700万条模型生成的推理轨迹。该数据集整合了人工编写的题目集,以及在多种推理模式与工具使用配置下系统生成的解题轨迹。
每道题目均会由[gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)模型在六种设置下(高/中/低推理深度 × 启用/禁用Python TIR)多次求解。我们采用**大语言模型作为评判者(LLM-as-a-judge)**的流程验证答案,并通过通过率过滤移除了过于简单或不可靠的题目。仅保留最终答案与验证后的参考答案一致的解题轨迹,最终得到一个具备挑战性、清洗完备且高质量的数据集,可用于训练与评估数学推理系统。
整个数据处理流水线的所有组件,包括题目提取与数据生成,均基于[NeMo-Skills](https://github.com/NVIDIA-NeMo/Skills)实现。如需详细信息,请参阅**[官方文档](https://nvidia-nemo.github.io/Skills/releases/nemotron-math-v2/)**。
本数据集可商用。
## 数据集所有者
英伟达公司(NVIDIA Corporation)
## 数据集创建日期
创建时间:2025年12月3日
最后修改时间:2025年12月18日
## 使用许可条款
Math GPT-OSS AOPS 数据集受[知识共享署名4.0国际许可协议(CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)管辖。
Math GPT-OSS StackOverflow 与 MathGenSelect 数据集受[知识共享署名-相同方式共享4.0国际许可协议(CC BY-SA 4.0)](https://creativecommons.org/licenses/by/4.0/)管辖。
## 预期用途
本数据集适用于:
* 训练大语言模型(LLM/Large Language Model)以执行结构化数学推理
* 对比研究工具增强型推理与纯语言推理
* 构建长上下文或多轨迹推理系统
* 评估大语言模型推理鲁棒性与解法多样性
* 针对推理模式、错误模式与验证流水线开展研究
### 数据集构成与生成
#### 题目提取
本数据集源自AoPS与StackExchange-Math论坛,但未直接使用原始帖子。由于论坛线程包含讨论、评论,有时还存在多个或不完整的问题,我们首先使用大语言模型(LLM)执行题目提取,从原始线程中分离出明确的数学问题陈述。随后,每道提取出的题目会经过一系列基于大语言模型的分类器,以判断其属于证明类题目、选择题、二元正误题,或是无效/依赖上下文的提示词;所有此类题目均被移除。对于原本为证明格式的问题,我们会执行「证明转答题」转换,尝试将其重写为可答题任务,同时保留概念难度;而非证明类题目则尝试从讨论中提取最终答案,而非完整解法。我们还通过基准数据集去重流程,移除了与公开数学数据集重复的题目。尽管我们的流水线包含证明转换步骤,但最终舍弃了所有转换后的证明类题目,因为我们的目标仅保留那些可明确验证最终答案的题目。因此,最终数据集仅包含非平凡、高质量的数学题目。
##### AoPS子集
AoPS子集源自[OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning)数据集,其原始来源为数学爱好者社区(Art of Problem Solving, AoPS)。
**特点:**
- 覆盖代数、几何、数论与组合数学的竞赛类题目
- 移除证明类题目以确保答案可验证性
- 通过难度过滤移除了模型可轻易求解的题目
- 最终子集规模:**约8.5万道题目**,均带有验证后的参考答案
##### StackExchange-Math子集
StackExchange-Math子集包含从[数学StackExchange](https://math.stackexchange.com/)与[MathOverflow](https://mathoverflow.net/)收集的题目,覆盖从本科水平到研究导向的广泛主题。
**特点:**
- 通过大语言模型分类器过滤移除证明类题目
- 执行去重流程以避免与公开基准数据集重复
- 通过难度过滤移除过于简单的题目
- 最终子集规模:**约26.2万道题目**
---
#### 推理轨迹生成
我们采用统一流水线为所有题目生成解题轨迹。
##### 推理配置
每道题目会在**六种配置**下求解:
- 推理深度:高、中、低
- 工具使用:启用Python TIR、禁用Python TIR
##### 采样
- **每个配置生成8个解**,使用不同随机种子
- 温度=1.0,top-p=1.0
##### 答案验证
参考答案通过以下流程确定:
- 若题目包含从论坛(AoPS或StackExchange)提取的答案,则仅当16个高推理深度模型生成解中的至少一个(8个启用Python TIR,8个禁用)生成的最终答案被判定与其一致时,该答案才会被保留。
- 若无提取的可用答案,或所有模型生成解均与提取答案不一致,则参考答案替换为16个高推理深度模型输出中的多数投票结果。
##### 过滤
- 移除在低推理深度设置下通过率高于0.8的题目
- 通过自动化大语言模型作为评判者的评估丢弃错误解法
##### 最终输出
最终数据集包含约**700万条经过过滤的推理轨迹**(原始轨迹约750万条),涵盖多样化的推理策略、工具交互与长格式解题模式。
#### 数据集字段
本数据集包含以下字段:
- **problem**:源自[OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning)、[数学StackExchange](https://math.stackexchange.com/)与[MathOverflow](https://mathoverflow.net/)的题目陈述。
- **messages**:用于大语言模型训练的标准化对话格式的用户与助手轮次数据。
- **expected_answer**:若`problem_type`为`has_answer_extracted`,则为提取得到的答案;否则为该题目所有生成解法中的多数投票答案。
`changed_answer_to_majority`:布尔值,仅当存在提取的论坛答案且所有模型生成解均与其不一致时,该标签才会被设为`true`(即使用模型生成解的多数投票答案替换了提取的论坛答案);否则为`false`(包括无论坛答案的情况)。
- **metadata**:不同推理模式与工具使用场景下的通过率(列表形式)
- **data_source**:数据集来源,为AoPS或StackExchange-Math
- **tool**:无可用工具时为空,有可用工具时为Python工具定义
- **url**:题目超链接
- **user_url**:发布题目的用户超链接
- **user_name**:题目的发布用户名
## 数据集特征
**数据收集方式**:混合式:自动化、合成式
## 数据集格式
模态:文本
格式:JSONL
结构:文本 + 元数据
## 参考文献
论文链接:[https://arxiv.org/abs/2512.15489](https://arxiv.org/abs/2512.15489)。
引用BibTeX格式:
bibtex
@article{du2025nemotronmath,
title = {Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision},
author = {Du, Wei and Toshniwal, Shubham and Kisacanin, Branislav and Mahdavi, Sadegh and Moshkov, Ivan and Armstrong, George and Ge, Stephen and Minasyan, Edgar and Chen, Feng and Gitman, Igor},
journal = {arXiv preprint arXiv:2512.15489},
year = {2025}
}
## 数据集量化统计
| 子集类型 | 样本数量 |
|--------|---------|
| 低推理深度 | 1,718,159 |
| 中推理深度 | 2,502,305 |
| 高推理深度 | 2,865,375 |
| 总计 | 7,085,839 |
总磁盘占用:约143GB
## 伦理考量
英伟达(NVIDIA)认为可信人工智能是一项共同责任,我们已制定相关政策与实践规范,以支持各类AI应用的开发。开发者在遵循本服务条款的前提下下载或使用本数据集时,应与其内部模型团队协作,确保该模型符合相关行业与使用场景的要求,并应对可能出现的产品误用问题。
请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交质量、风险、安全漏洞或英伟达AI相关问题反馈。
提供机构:
maas
创建时间:
2025-12-16



