Nemotron-Math-Proofs-v1
收藏魔搭社区2026-01-02 更新2025-12-27 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Nemotron-Math-Proofs-v1
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description:
Nemotron-Math-Proofs-v1 is a large-scale mathematical reasoning dataset containing ~580k natural language proof problems, ~550k formalizations into theorem statements in Lean 4, and ~900k model-generated reasoning trajectories culminating in Lean 4 proofs. The dataset integrates human-authored problems with systematically generated formalizations and solution traces.
Each natural language problem is formalized by [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) into Lean 4 theorem statements with placeholders for the proof. Reasoning traces and proofs are attempted by [Goedel-Prover-V2-32B](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B). Proofs are verified by the Lean 4 compiler, with errors passed back and self-correction attempted.
This dataset is ready for commercial use.
## Dataset Owner(s):
NVIDIA Corporation
## Dataset Creation Date:
Created on: Dec 3, 2025
Last Modified on: Dec 15, 2025
## License/Terms of Use:
This dataset is governed by the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/).
## Intended Usage:
This dataset is intended for:
* Training LLMs to perform structured mathematical reasoning, in particular generation of Lean 4 formal proofs.
* Building long-context or multi-trajectory reasoning systems
* Research on reasoning modes, error patterns, and verification pipelines
## Dataset Characterization
### Dataset Composition and Generation
#### Natural language problem sourcing
We collected problems from the AoPS community, Math StackExchange, and MathOverflow, filtering for proof based problems and those that can be formulated as theorems. See [our paper](https://arxiv.org/abs/2512.15489) for more details on the problem sources and pipeline to extract and prepare natural language problems. We semantically deduplicated all problems and decontaminated them with popular benchmarks. This removed ~40% of the original dataset and resulted in ~580k natural language theorems.
#### Autoformalization
The natural language theorems were put into an autoformalization pipeline powered by [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) to obtain formalized statements in Lean. The autoformalization pipeline consists of the following stages:
* Initial Formalization: The LLM is prompted to translate a natural language theorem statement into Lean 4 formal code. The generated code is extracted and cleaned (removing comments, organizing imports, handling sorry placeholders).
* Lean 4 Compilation Check: The extracted code is sent to a Lean 4 sandbox for compilation. This verifies syntactic correctness and type-checking.
* Backtranslation & Semantic Verification: If compilation succeeds, the formal code is backtranslated to natural language by a second LLM call. A judge prompt then compares this backtranslation to the original problem statement to verify the formalization preserves the intended meaning.
* Iterative Refinement Loop: When any step fails, the pipeline constructs an error-specific refinement prompt:
* Parsing errors: prompt to fix code extraction issues
* Compilation errors: prompt including the Lean error message
* Semantic inconsistency: prompt including the judge's reasoning
* The LLM then attempts to fix the code, and the process repeats (up to 8 iterations).
* Adaptive Reasoning: Automatically reduces the reasoning effort level (high → medium → low) when generation fails due to context length limits, allowing the pipeline to recover from token budget exhaustion.
* Faithfulness Filtering: An additional round of final Faithfulness Filtering is applied, where an LLM-as-judge is prompted to assess the faithfulness of the formalization to the original natural language statement. A statement judged innacurate in any of three passes was removed.
This resulted in ~550k formalized theorem statements.
#### Automated theorem proving
Given the theorem statements, we ran large scale inference using [Goedel-Prover-V2-32B](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B) as a prover, in order to obtain reasoning traces and compilable blocks of Lean code representing formally checked solutions. Along with reasoning through the theorems, planning Lean writing, and generating compilable Lean code, the language model is also able to respond to compiler feedback and self-correct erroneous attempts. The prover pipeline consists of the following stages:
* Initial Proof Generation: The LLM receives a formal statement (header + informal prefix + formal statement) and generates a proof attempt. The code is extracted and merged with the original statement.
* Lean 4 Verification: The proof is compiled in the Lean 4 sandbox. Success requires: completed compilation, no `sorry` placeholders, and no failure messages.
* Error-Enriched Feedback: On failure, errors are parsed and formatted with positional markers (`<error></error>` tags) to indicate exactly where in the code the error occurred. This structured feedback is passed back to the model.
* Iterative Refinement Loop: The conversation continues with error feedback until success or the maximum number of turns is reached.
* Pass@N Sampling: The entire refinement loop can be restarted from scratch up to N times, providing multiple independent proving attempts.
During proof generation, we employ a multi-turn error correction/refinement strategy with wrong turn deletion:
* Chain-of-thought removal: After each generation, the model's chain-of-thought reasoning is stripped, keeping only the formal proof code within ```lean4``` blocks
* Wrong turn deletion: Rather than accumulating conversation history, failed attempts are discarded. Each refinement turn sees only the base prompt plus the most recent proof attempt, preventing context pollution from earlier failures
* Structured error feedback: Compiler errors are annotated with `<error></error>` tags marking their location in the source code, helping the model localize issues
* Pass@N with refinement: Each problem is attempted with `n_pass=4` where each attempt allows up to 8 refinement turns
This approach balances context efficiency (by not accumulating failed reasoning traces) with iterative improvement (by providing structured error feedback for self-correction).
#### Limitations
Below we flag some limitations remaining in the dataset that we aim to improve in future iterations:
* Difficulty balance: beyond filtering for proof problems there was no explicit attempt to normalize or balance the problems present in the dataset. There are implicit selection biases resulting from various stages (autoformalization success, proof attempt success, etc.).
* Token count/length normalization: final verification was only dependent on Lean 4 compiler containing no errors. Some included solutions contain numerous warnings pertaining to extraneous generated lines such as unused hypotheses.
* Placeholder/vacuous definition: we observed instances of the autoformalizer using placeholder trivial definitions instead of mathlib, e.g. def MyContinuous... Prop := True. This can be detected with a final check that the proof only depends on the three foundational axioms.
---
### Training Results
We compare a Qwen3-8B SFT against Goedel-Prover-V2-8B, a model specifically trained for Lean theorem proving:
| Model | pass@32 (no self-correction) | pass@32 (with self-correction) |
|-------|------------------------------|-------------------------------|
| Goedel-Prover-V2-8B | 84.6% | 86.7% |
| Qwen3-8B SFT-ed on Nemotron-MathProofs-v1 | 85.3% | 90.2% |
This performance is comparable to models specifically trained for Lean theorem proving, such as DeepSeek-Prover-V2-7B, Kimina-Prover-8B, and Goedel-Prover-V2-8B.
This dataset was used as part of the SFT data for [Nemotron-Nano-3](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3). The final model achieves the following results on miniF2F:
| Model | pass@32 (no self-correction) | pass@32 (with self-correction) |
|-------|------------------------------|-------------------------------|
| Nemotron-Nano-3 | 79.92% | 86.89% |
| gpt-oss-20b | 43.03% | 59.42% |
| Qwen3-30B-A3B-Thinking | 16.80% | 32.37% |
---
### Dataset Fields
| Field | Type | Description |
|-------|------|-------------|
| `uuid` | `str` | Unique identifier |
| `problem` | `str` | Natural language problem statement (always present) |
| `source` | `str` | Problem source: `"aops"` or `"mathstack"` (always present) |
| `formal_statement` | `str \| null` | Lean 4 theorem code (null for ~4% of entries) |
| `lean_header` | `str \| null` | Lean import statements and setup |
| `messages` | `list` | Verified proof attempts as chat conversations; each entry produces correct Lean 4 code (empty list if none) |
| `url` | `str \| null` | Original post URL (mathstack only) |
| `user_name` | `str \| null` | Original poster's username (mathstack only) |
| `user_url` | `str \| null` | Original poster's profile URL (mathstack only) |
| `used_in` | `list` | Reference for which Nemotron models included this sample during training |
| `tools` | `list` | Tool definition |
---
### Reproduction
This dataset was created using [NeMo-Skills](https://github.com/NVIDIA/NeMo-Skills). For detailed instructions, prompts, and commands to reproduce evaluation numbers for Nemotron-Nano-3, see the [Nemotron-MathProofs-v1 documentation](https://nvidia-nemo.github.io/Skills/releases/nemotronmathproofs/).
**Data Collection Method**
Hybrid: Automated, Synthetic
**Data Collection Method**
Hybrid: Automated, Synthetic
## Dataset Format
Modality: Text
Format: JSONL
Structure: Text + Metadata
## Dataset Quantification
| Subset | Samples |
|--------|---------|
| train | 1,376,666 |
Total Disk Size: ~28 GB
## Reference(s):
- Agarwal, Sandhini, et al. *gpt-oss-120b & gpt-oss-20b model card.* arXiv preprint arXiv:2508.10925 (2025).
- Du, Wei, et al. *Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision.* arXiv preprint arXiv:2512.15489 (2025)
- Lin, Yong, et al. *Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.* arXiv preprint arXiv:2508.03613 (2025).
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)
## 数据集描述
Nemotron-Math-Proofs-v1 是一个大规模数学推理数据集,包含约58万个自然语言证明问题、约55万个Lean 4定理陈述形式化表示,以及约90万个以Lean 4证明收尾的模型生成推理轨迹。本数据集整合了人工编写的问题、系统生成的形式化表示与求解轨迹。
每个自然语言问题均由[gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)形式化为带有证明占位符的Lean 4定理陈述。推理轨迹与证明尝试由[Goedel-Prover-V2-32B](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B)完成。证明由Lean 4编译器验证,错误会被回传并尝试自校正。
本数据集可用于商业用途。
## 数据集所有者:
英伟达(NVIDIA)公司
## 数据集创建日期:
创建于:2025年12月3日
最后修改于:2025年12月15日
## 许可/使用条款:
本数据集受[知识共享署名-相同方式共享4.0国际许可协议(CC BY-SA 4.0,Creative Commons Attribution-ShareAlike 4.0 International License)](https://creativecommons.org/licenses/by-sa/4.0/)管辖。
## 预期用途
本数据集旨在用于:
* 训练大语言模型(Large Language Model,LLM)进行结构化数学推理,尤其是生成Lean 4形式化证明。
* 构建长上下文或多轨迹推理系统
* 开展推理模式、错误模式以及验证流水线相关研究
## 数据集特征
### 数据集组成与生成
#### 自然语言问题来源
我们从AoPS社区、数学 Stack Exchange 以及 MathOverflow 收集问题,筛选出以证明为目标且可表述为定理的问题。有关问题来源以及提取和预处理自然语言问题的流程的更多细节,请参阅[我们的论文](https://arxiv.org/abs/2512.15489)。我们对所有问题进行了语义去重,并使用主流基准数据集进行了数据净化,该步骤移除了约40%的原始数据集,最终得到约58万个自然语言定理。
#### 自动形式化
我们将自然语言定理送入由gpt-oss-120b驱动的自动形式化流水线,以获得Lean 4中的形式化陈述。自动形式化流水线包含以下阶段:
* 初始形式化:提示大语言模型将自然语言定理陈述翻译为Lean 4形式化代码。提取并清理生成的代码(移除注释、整理导入语句、处理`sorry`占位符)。
* Lean 4编译检查:将提取的代码发送至Lean 4沙箱进行编译,验证语法正确性与类型检查。
* 回译与语义验证:若编译成功,通过第二次大语言模型调用将形式化代码回译为自然语言。随后通过评判提示词比较该回译结果与原始问题陈述,验证形式化表示保留了原有的语义意图。
* 迭代优化循环:若任一阶段失败,流水线将构建针对特定错误的优化提示:
* 解析错误:提示修复代码提取问题
* 编译错误:附带Lean编译器的错误信息
* 语义不一致:附带评判模型的推理过程
* 随后大语言模型尝试修复代码,该过程重复最多8次。
* 自适应推理:当因上下文长度限制导致生成失败时,自动降低推理力度等级(高→中→低),使流水线能够从Token预算耗尽的情况中恢复。
* 忠实性过滤:执行额外的最终忠实性过滤轮次,由“大语言模型作为评判者”提示评估形式化表示与原始自然语言陈述的忠实程度。若三次评估中有任意一次判定形式化表示不准确,则移除该样本。
该流程最终得到约55万个形式化定理陈述。
#### 自动化定理证明
针对得到的定理陈述,我们使用Goedel-Prover-V2-32B作为证明器开展大规模推理,以获得推理轨迹以及可编译的Lean代码块,代表经过形式化验证的求解方案。除了对定理进行推理、规划Lean代码编写以及生成可编译的Lean代码外,该语言模型还能够响应编译器反馈并自校正错误的尝试。证明器流水线包含以下阶段:
* 初始证明生成:大语言模型接收形式化陈述(头部+非正式前缀+形式化陈述)并生成证明尝试。提取代码并与原始陈述合并。
* Lean 4验证:在Lean 4沙箱中编译证明。编译成功需要满足:完成编译、无`sorry`占位符且无失败信息。
* 带错误标注的反馈:若编译失败,解析错误并使用位置标记(`<error></error>`标签)格式化,以准确指示代码中错误发生的位置。将该结构化反馈回传给模型。
* 迭代优化循环:将错误反馈加入对话,直至编译成功或达到最大轮次限制。
* Pass@N采样:可将整个优化循环从头重启最多N次,提供多次独立的证明尝试。
在证明生成阶段,我们采用多轮错误校正/优化策略,并移除错误轨迹:
* 思维链移除:每次生成后,移除模型的思维链推理内容,仅保留lean4代码块内的形式化证明代码。
* 错误轨迹删除:不累积对话历史,丢弃失败的尝试。每次优化轮次仅向模型提供基础提示与最新的证明尝试,避免早期失败的推理轨迹污染上下文。
* 结构化错误反馈:编译器错误通过`<error></error>`标签标注其在源代码中的位置,帮助模型定位问题。
* 带优化的Pass@N采样:每个问题设置`n_pass=4`,每次尝试允许最多8次优化轮次。
该方法平衡了上下文效率(不累积失败的推理轨迹)与迭代改进(提供结构化错误反馈以实现自校正)之间的关系。
#### 局限性
以下列出了本数据集尚存的局限性,我们计划在后续迭代中加以改进:
* 难度均衡:除了筛选以证明为目标的问题外,未对数据集内的问题进行显式的归一化或均衡处理。自动形式化成功率、证明尝试成功率等各个阶段均会引入隐含的选择偏差。
* Token计数/长度归一化:最终验证仅依赖Lean 4编译器无错误。部分包含的解决方案存在大量无关生成行的警告,例如未使用的假设。
* 占位符/空定义:我们观察到自动形式化器使用占位符平凡定义而非mathlib库的情况,例如`def MyContinuous... Prop := True`。可通过最终检查仅依赖三个基础公理的证明来检测此类问题。
---
### 训练结果
我们将Qwen3-8B监督微调(SFT)模型与专门针对Lean定理证明训练的Goedel-Prover-V2-8B进行了比较:
| 模型 | pass@32(无自校正) | pass@32(带自校正) |
|-------|------------------------------|-------------------------------|
| Goedel-Prover-V2-8B | 84.6% | 86.7% |
| 在Nemotron-MathProofs-v1上进行SFT的Qwen3-8B | 85.3% | 90.2% |
该性能与专门针对Lean定理证明训练的模型(如DeepSeek-Prover-V2-7B、Kimina-Prover-8B以及Goedel-Prover-V2-8B)相当。
本数据集被用作[Nemotron-Nano-3](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3)监督微调数据的一部分。最终模型在miniF2F基准上的表现如下:
| 模型 | pass@32(无自校正) | pass@32(带自校正) |
|-------|------------------------------|-------------------------------|
| Nemotron-Nano-3 | 79.92% | 86.89% |
| gpt-oss-20b | 43.03% | 59.42% |
| Qwen3-30B-A3B-Thinking | 16.80% | 32.37% |
---
### 数据集字段
| 字段 | 类型 | 描述 |
|-------|------|-------------|
| `uuid` | `str` | 唯一标识符 |
| `problem` | `str` | 自然语言问题陈述(始终存在) |
| `source` | `str` | 问题来源:`"aops"` 或 `"mathstack"`(始终存在) |
| `formal_statement` | `str | null` | Lean 4定理代码(约4%的条目为空) |
| `lean_header` | `str | null` | Lean导入语句与设置项 |
| `messages` | `list` | 经过验证的证明尝试,以对话形式呈现;每个条目生成正确的Lean 4代码(若无则为空列表) |
| `url` | `str | null` | 原始帖子URL(仅数学 Stack Exchange 数据集包含) |
| `user_name` | `str | null` | 原始发帖者用户名(仅数学 Stack Exchange 数据集包含) |
| `user_url` | `str | null` | 原始发帖者个人资料URL(仅数学 Stack Exchange 数据集包含) |
| `used_in` | `list` | 训练期间使用该样本的Nemotron模型参考 |
| `tools` | `list` | 工具定义 |
---
### 复现方法
本数据集使用[NeMo-Skills](https://github.com/NVIDIA/NeMo-Skills)创建。有关详细的操作步骤、提示词以及复现Nemotron-Nano-3评估结果的命令,请参阅[Nemotron-MathProofs-v1文档](https://nvidia-nemo.github.io/Skills/releases/nemotronmathproofs/)。
**数据收集方法**
混合模式:自动化、合成式
## 数据集格式
模态:文本
格式:JSONL
结构:文本 + 元数据
## 数据集量化
| 子集 | 样本量 |
|--------|---------|
| 训练集 | 1,376,666 |
总磁盘占用:约28 GB
## 参考文献:
- Agarwal, Sandhini, 等人. *gpt-oss-120b 与 gpt-oss-20b 模型卡片*. arXiv预印本 arXiv:2508.10925 (2025).
- Du, Wei, 等人. *Nemotron-Math: 基于多模式监督的数学推理高效长上下文蒸馏*. arXiv预印本 arXiv:2512.15489 (2025)
- Lin, Yong, 等人. *Goedel-prover-v2: 利用脚手架数据合成与自校正扩展形式化定理证明*. arXiv预印本 arXiv:2508.03613 (2025).
## 伦理考量:
英伟达认为可信AI是一项共同责任,我们已制定相关政策与实践,以支持各类AI应用的开发。开发者在按照我们的服务条款下载或使用本模型时,应与其内部模型团队合作,确保本模型符合相关行业与用例的要求,并解决未预见的产品误用问题。
请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交质量、风险、安全漏洞或NVIDIA AI相关问题。
提供机构:
maas
创建时间:
2025-12-16



