OpenMathReasoning

Name: OpenMathReasoning
Creator: maas
Published: 2026-05-16 20:37:23
License: 暂无描述

魔搭社区2026-05-16 更新2025-04-26 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/OpenMathReasoning

下载链接

链接失效反馈

官方服务：

资源简介：

# OpenMathReasoning OpenMathReasoning is a large-scale math reasoning dataset for training large language models (LLMs). This dataset contains * 306K unique mathematical problems sourced from [AoPS forums](https://artofproblemsolving.com/community) with: * 3.2M long chain-of-thought (CoT) solutions * 1.7M long tool-integrated reasoning (TIR) solutions * 566K samples that select the most promising solution out of many candidates (GenSelect) * Additional 193K problems sourced from AoPS forums (problems only, no solutions) We used [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) to preprocess problems, and [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) and [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) to generate solutions. This dataset was a foundation of our winning submission to the [AIMO-2 Kaggle competition](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard). See our [paper](https://arxiv.org/abs/2504.16891) to learn more details! **_NOTE:_** We initially reported 540K unique problems in our dataset, but this figure represented the question count at the pipeline's beginning. But the actual CoT and TIR solutions in the released dataset correspond to 306K problems. Since our OpenMath-Nemotron models were trained on this reduced subset, all published results remain reproducible with the current release—only our initial problem count was overstated. Two factors explain this discrepancy in question numbers. - Our filtering process removed many questions due to format restrictions (yes/no questions, multiple choice questions, etc.), benchmark decontamination, and the inability of existing LLMs to generate valid solutions for certain problems. - A pipeline bug caused us to lose 137K proof-based questions. When we recovered and included this additional data in training, the SFT performance regressed. We are currently testing different approaches for incorporating these recovered questions and will release their solutions only if we identify clear performance improvements. **_NOTE:_** An early version of this data was released separately in [Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset). ## Dataset fields OpenMathReasoning dataset contains the following fields: - **problem**: Problem statement extracted from [AoPS forums](https://artofproblemsolving.com/community) and refined with [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) - **generated_solution**: Synthetically generated solution using either [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) or [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) - **generation_model**: DeepSeek-R1 or QwQ-32B - **problem_type**: Can be one of "has_answer_extracted", "no_answer_extracted" and "converted_proof" dependening on whether we were able to extract the answer or if this is a proof question converted to answer question. - **expected_answer**: Extracted answer if "problem_type" is "has_answer_extracted". Otherwise this is the majority-voting answer across all generated solutions for this problem. - **problem_source**: States the corresponding AoPS forum (e.g. "aops_c6_high_school_olympiads") or "MATH_training_set" as we also include a small set of generations from [MATH](https://github.com/hendrycks/math). - **inference_mode**: "cot", "tir" or "genselect" - **pass_rate_72b_tir**: Pass rate out of 32 generations for [Qwen2.5-Math-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct) run in TIR mode. This attribute is only available when "problem_type" is "has_answer_extracted" and is set to "n/a" for other cases. - **used_in_kaggle**: Whether the instance was used in training the winning model for [AIMO-2 Kaggle competition](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard) or not. We had used 2.2M CoT and 15K TIR solutions for training the [OpenMath-Nemotron-14B-Kaggle](https://huggingface.co/nvidia/OpenMath-Nemotron-14B-Kaggle) model. Note that for training the OpenMath-Nemotron models, we used all the CoT, TIR, and GenSelect data, except for the TIR subset used in Kaggle. ## OpenMath-Nemotron models To demonstrate the quality of this dataset, we release a series of OpenMath-Nemotron models trained on this data. * [OpenMath-Nemotron-1.5B](https://huggingface.co/nvidia/OpenMath-Nemotron-1.5B) * [OpenMath-Nemotron-7B](https://huggingface.co/nvidia/OpenMath-Nemotron-7B) * [OpenMath-Nemotron-14B](https://huggingface.co/nvidia/OpenMath-Nemotron-14B) * [OpenMath-Nemotron-14B-Kaggle](https://huggingface.co/nvidia/OpenMath-Nemotron-14B-Kaggle) (this is the model used in [AIMO-2 Kaggle competition](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard)) * [OpenMath-Nemotron-32B](https://huggingface.co/nvidia/OpenMath-Nemotron-32B) ![Evaluation Results](./results.png) The models achieve state-of-the-art results on popular mathematical benchmarks. We present metrics as pass@1 (maj@64) where pass@1 is an average accuracy across 64 generations and maj@64 is the result of majority voting. Please see our [paper](https://arxiv.org/abs/2504.16891) for more details on the evaluation setup. | Model | AIME24 | AIME25 | HMMT-24-25 | HLE-Math | |-------------------------------|-----------------|-------|-------|-------------| | DeepSeek-R1-Distill-Qwen-1.5B | 26.8 (60.0) | 21.4 (36.7) | 14.2 (26.5) | 2.9 (5.0) | | [OpenMath-Nemotron-1.5B](https://huggingface.co/nvidia/OpenMath-Nemotron-1.5B) CoT | 61.6 (80.0) | 49.5 (66.7) | 39.9 (53.6) | 5.4 (5.4) | | [OpenMath-Nemotron-1.5B](https://huggingface.co/nvidia/OpenMath-Nemotron-1.5B) TIR | 52.0 (83.3) | 39.7 (70.0) | 37.2 (60.7) | 2.5 (6.2) | | + Self GenSelect | 83.3 | 70.0 | 62.2 | 7.9 | | + 32B GenSelect | 83.3 | 70.0 | 62.8 | 8.3 | | DeepSeek-R1-Distill-Qwen-7B | 54.4 (80.0) | 38.6 (53.3) | 30.6 (42.9) | 3.3 (5.2) | | [OpenMath-Nemotron-7B](https://huggingface.co/nvidia/OpenMath-Nemotron-7B) CoT | 74.8 (80.0) | 61.2 (76.7) | 49.7 (57.7) | 6.6 (6.6) | | [OpenMath-Nemotron-7B](https://huggingface.co/nvidia/OpenMath-Nemotron-7B) TIR | 72.9 (83.3) | 57.5 (76.7) | 54.6 (66.3) | 7.8 (10.8) | | + Self GenSelect | 86.7 | 76.7 | 68.4 | 11.5 | | + 32B GenSelect | 86.7 | 76.7 | 69.9 | 11.9 | | DeepSeek-R1-Distill-Qwen-14B | 65.8 (80.0) | 48.4 (60.0) | 40.1 (52.0) | 4.2 (4.8) | | [OpenMath-Nemotron-14B-MIX (kaggle)](https://huggingface.co/nvidia/OpenMath-Nemotron-14B-Kaggle) | 73.7 (86.7) | 57.9 (73.3) | 50.5 (64.8) | 5.7 (6.5) | | [OpenMath-Nemotron-14B](https://huggingface.co/nvidia/OpenMath-Nemotron-14B) CoT | 76.3 (83.3) | 63.0 (76.7) | 52.1 (60.7) | 7.5 (7.6) | | [OpenMath-Nemotron-14B](https://huggingface.co/nvidia/OpenMath-Nemotron-14B) TIR | 76.3 (86.7) | 61.3 (76.7) | 58.6 (70.9) | 9.5 (11.5) | | + Self GenSelect | 86.7 | 76.7 | 72.4 | 14.1 | | + 32B GenSelect | 90.0 | 76.7 | 71.9 | 13.7 | | QwQ-32B | 78.1 (86.7) | 66.5 (76.7) | 55.9 (63.3) | 9.0 (9.5) | | DeepSeek-R1-Distill-Qwen-32B | 66.9 (83.3) | 51.8 (73.3) | 39.9 (51.0) | 4.8 (6.0) | | [OpenMath-Nemotron-32B](https://huggingface.co/nvidia/OpenMath-Nemotron-32B) CoT | 76.5 (86.7) | 62.5 (73.3) | 53.0 (59.2) | 8.3 (8.3) | | [OpenMath-Nemotron-32B](https://huggingface.co/nvidia/OpenMath-Nemotron-32B) TIR | 78.4 (93.3) | 64.2 (76.7) | 59.7 (70.9) | 9.2 (12.5) | | + Self GenSelect | 93.3 | 80.0 | 73.5 | 15.7 | | DeepSeek-R1 | 79.1 (86.7) | 64.3 (73.3) | 53.0 (59.2) | 10.5 (11.4) | ## Reproducing our results The pipeline we used to produce the data and models is fully open-sourced! - [Code](https://github.com/NVIDIA/NeMo-Skills) - [Models](https://huggingface.co/collections/nvidia/openmathreasoning-68072c0154a5099573d2e730) - [Dataset](https://huggingface.co/datasets/nvidia/OpenMathReasoning) - [Paper](https://arxiv.org/abs/2504.16891) We provide [all instructions](https://nvidia.github.io/NeMo-Skills/openmathreasoning1/) to fully reproduce our results, including data generation. ## Citation If you find our work useful, please consider citing us! ```bibtex @article{moshkov2025aimo2, title = {AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset}, author = {Ivan Moshkov and Darragh Hanley and Ivan Sorokin and Shubham Toshniwal and Christof Henkel and Benedikt Schifferer and Wei Du and Igor Gitman}, year = {2025}, journal = {arXiv preprint arXiv:2504.16891} } ``` ## Dataset Owner(s): NVIDIA Corporation ## Release Date: 04/23/2025 ## Data Version 1.0 (04/23/2025) ## License/Terms of Use: cc-by-4.0 ## Intended Usage: This dataset is intended to be used by the community to continue to improve models. The data may be freely used to train and evaluate. ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

# OpenMathReasoning OpenMathReasoning是一款用于训练大语言模型（Large Language Model, LLM）的大规模数学推理数据集。本数据集包含： - 30.6万个独特的数学题目，均源自[AoPS论坛](https://artofproblemsolving.com/community)，配套资源包括： - 320万条长思维链（Chain-of-Thought, CoT）解题过程 - 170万条长工具集成推理（Tool-Integrated Reasoning, TIR）解题过程 - 56.6万个用于从多个候选解题方案中选取最优解的样本（GenSelect） - 另有19.3万个题目同样源自AoPS论坛，仅包含题目本身，无配套解题过程。我们使用[Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)对题目进行预处理，并使用[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)与[QwQ-32B](https://huggingface.co/Qwen/QwQ-32B)生成解题方案。本数据集是我们在[AIMO-2 Kaggle竞赛](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard)中获奖参赛方案的核心训练数据基础。如需了解更多细节，请参阅我们的[论文](https://arxiv.org/abs/2504.16891)！ **_注意事项：_** 我们最初报告本数据集包含54万个独特题目，但该数值仅代表数据流水线初始阶段的题目总量。本次发布的数据集中，实际配套思维链与工具集成推理解题过程的题目仅为30.6万个。由于我们的OpenMath-Nemotron系列模型均基于该精简子集训练，所有已发表的实验结果仍可通过当前发布版本复现——仅初始报告的题目数量存在夸大。题目数量出现差异的原因有两点： 1. 我们的过滤流程移除了大量题目，原因包括格式限制（如判断题、选择题等）、基准数据集去重污染，以及现有大语言模型无法为部分题目生成有效解题过程。 2. 数据流水线存在一处缺陷，导致我们丢失了13.7万个基于证明的题目。在我们恢复并将该部分数据加入训练后，监督微调（Supervised Fine-Tuning, SFT）的模型性能出现了退化。目前我们正在测试不同的方法以整合这些恢复的题目，仅当明确能带来性能提升时，我们才会发布其配套解题过程。 **_注意事项：_** 本数据集的早期版本曾以[Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset)为名单独发布。 ## 数据集字段 OpenMathReasoning数据集包含以下字段： - **problem（题目文本）**：从[AoPS论坛](https://artofproblemsolving.com/community)提取并经[Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)优化后的题目描述。 - **generated_solution（生成的解题过程）**：使用[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)或[QwQ-32B](https://huggingface.co/Qwen/QwQ-32B)合成生成的解题方案。 - **generation_model（生成模型）**：用于生成解题方案的模型，可选值为DeepSeek-R1或QwQ-32B。 - **problem_type（题目类型）**：表示题目类型的标签，可选值包括`has_answer_extracted`（已提取答案）、`no_answer_extracted`（未提取答案）与`converted_proof`（转换后的证明题），具体取决于我们是否能提取题目答案，或是该题目为转换为答题形式的证明题。 - **expected_answer（预期答案）**：当`problem_type`为`has_answer_extracted`时，为提取得到的题目答案；否则为该题目所有生成解题方案中经多数投票得到的共识答案。 - **problem_source（题目来源）**：标注题目对应的AoPS论坛分区（例如`aops_c6_high_school_olympiads`），或`MATH_training_set`，因为我们还纳入了少量来自[MATH数据集](https://github.com/hendrycks/math)的生成题目。 - **inference_mode（推理模式）**：表示解题过程的生成模式，可选值为`cot`（思维链）、`tir`（工具集成推理）或`genselect`（最优解选取）。 - **pass_rate_72b_tir（72B模型工具集成推理通过率）**：[Qwen2.5-Math-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct)以工具集成推理模式运行时，32次生成中的正确解答通过率。该字段仅在`problem_type`为`has_answer_extracted`时可用，其余情况均设为`n/a`（不可用）。 - **used_in_kaggle（是否用于Kaggle竞赛训练）**：标注该数据样本是否用于训练[AIMO-2 Kaggle竞赛](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard)的获奖模型。我们曾使用220万条思维链解题过程与1.5万条工具集成推理解题过程，训练[OpenMath-Nemotron-14B-Kaggle](https://huggingface.co/nvidia/OpenMath-Nemotron-14B-Kaggle)模型。需注意，在训练OpenMath-Nemotron系列模型时，我们使用了除本次Kaggle竞赛所用工具集成推理子集外的全部思维链、工具集成推理与最优解选取数据。 ## OpenMath-Nemotron系列模型为展示本数据集的训练效果，我们发布了基于该数据集训练的一系列OpenMath-Nemotron模型： * [OpenMath-Nemotron-1.5B](https://huggingface.co/nvidia/OpenMath-Nemotron-1.5B) * [OpenMath-Nemotron-7B](https://huggingface.co/nvidia/OpenMath-Nemotron-7B) * [OpenMath-Nemotron-14B](https://huggingface.co/nvidia/OpenMath-Nemotron-14B) * [OpenMath-Nemotron-14B-Kaggle](https://huggingface.co/nvidia/OpenMath-Nemotron-14B-Kaggle)（该模型即为[AIMO-2 Kaggle竞赛](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard)的获奖模型） * [OpenMath-Nemotron-32B](https://huggingface.co/nvidia/OpenMath-Nemotron-32B) ![评估结果](./results.png) 本系列模型在主流数学基准测试中取得了当前最优性能。我们采用`pass@1 (maj@64)`作为评估指标，其中`pass@1`为64次生成的平均准确率，`maj@64`为64次生成结果的多数投票结果。如需了解评估流程的更多细节，请参阅我们的[论文](https://arxiv.org/abs/2504.16891)。 | 模型 | AIME24 | AIME25 | HMMT-24-25 | HLE-Math | |-------------------------------|-----------------|-------|-------|-------------| | DeepSeek-R1-Distill-Qwen-1.5B | 26.8 (60.0) | 21.4 (36.7) | 14.2 (26.5) | 2.9 (5.0) | | [OpenMath-Nemotron-1.5B](https://huggingface.co/nvidia/OpenMath-Nemotron-1.5B) CoT | 61.6 (80.0) | 49.5 (66.7) | 39.9 (53.6) | 5.4 (5.4) | | [OpenMath-Nemotron-1.5B](https://huggingface.co/nvidia/OpenMath-Nemotron-1.5B) TIR | 52.0 (83.3) | 39.7 (70.0) | 37.2 (60.7) | 2.5 (6.2) | | + Self GenSelect | 83.3 | 70.0 | 62.2 | 7.9 | | + 32B GenSelect | 83.3 | 70.0 | 62.8 | 8.3 | | DeepSeek-R1-Distill-Qwen-7B | 54.4 (80.0) | 38.6 (53.3) | 30.6 (42.9) | 3.3 (5.2) | | [OpenMath-Nemotron-7B](https://huggingface.co/nvidia/OpenMath-Nemotron-7B) CoT | 74.8 (80.0) | 61.2 (76.7) | 49.7 (57.7) | 6.6 (6.6) | | [OpenMath-Nemotron-7B](https://huggingface.co/nvidia/OpenMath-Nemotron-7B) TIR | 72.9 (83.3) | 57.5 (76.7) | 54.6 (66.3) | 7.8 (10.8) | | + Self GenSelect | 86.7 | 76.7 | 68.4 | 11.5 | | + 32B GenSelect | 86.7 | 76.7 | 69.9 | 11.9 | | DeepSeek-R1-Distill-Qwen-14B | 65.8 (80.0) | 48.4 (60.0) | 40.1 (52.0) | 4.2 (4.8) | | [OpenMath-Nemotron-14B-MIX (kaggle)](https://huggingface.co/nvidia/OpenMath-Nemotron-14B-Kaggle) | 73.7 (86.7) | 57.9 (73.3) | 50.5 (64.8) | 5.7 (6.5) | | [OpenMath-Nemotron-14B](https://huggingface.co/nvidia/OpenMath-Nemotron-14B) CoT | 76.3 (83.3) | 63.0 (76.7) | 52.1 (60.7) | 7.5 (7.6) | | [OpenMath-Nemotron-14B](https://huggingface.co/nvidia/OpenMath-Nemotron-14B) TIR | 76.3 (86.7) | 61.3 (76.7) | 58.6 (70.9) | 9.5 (11.5) | | + Self GenSelect | 86.7 | 76.7 | 72.4 | 14.1 | | + 32B GenSelect | 90.0 | 76.7 | 71.9 | 13.7 | | QwQ-32B | 78.1 (86.7) | 66.5 (76.7) | 55.9 (63.3) | 9.0 (9.5) | | DeepSeek-R1-Distill-Qwen-32B | 66.9 (83.3) | 51.8 (73.3) | 39.9 (51.0) | 4.8 (6.0) | | [OpenMath-Nemotron-32B](https://huggingface.co/nvidia/OpenMath-Nemotron-32B) CoT | 76.5 (86.7) | 62.5 (73.3) | 53.0 (59.2) | 8.3 (8.3) | | [OpenMath-Nemotron-32B](https://huggingface.co/nvidia/OpenMath-Nemotron-32B) TIR | 78.4 (93.3) | 64.2 (76.7) | 59.7 (70.9) | 9.2 (12.5) | | + Self GenSelect | 93.3 | 80.0 | 73.5 | 15.7 | | DeepSeek-R1 | 79.1 (86.7) | 64.3 (73.3) | 53.0 (59.2) | 10.5 (11.4) | ## 复现实验结果我们用于构建本数据集与模型的完整流水线已完全开源！ - [代码仓库](https://github.com/NVIDIA/NeMo-Skills) - [模型集合](https://huggingface.co/collections/nvidia/openmathreasoning-68072c0154a5099573d2e730) - [数据集](https://huggingface.co/datasets/nvidia/OpenMathReasoning) - [论文](https://arxiv.org/abs/2504.16891) 我们提供了[完整的复现指南](https://nvidia.github.io/NeMo-Skills/openmathreasoning1/)，可用于复现全部实验结果，包括数据生成流程。 ## 引用方式如果您认为我们的工作对您有所帮助，请引用以下文献： bibtex @article{moshkov2025aimo2, title = {AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset}, author = {Ivan Moshkov and Darragh Hanley and Ivan Sorokin and Shubham Toshniwal and Christof Henkel and Benedikt Schifferer and Wei Du and Igor Gitman}, year = {2025}, journal = {arXiv preprint arXiv:2504.16891} } ## 数据集所有者英伟达公司（NVIDIA Corporation） ## 发布日期 2025年4月23日 ## 数据版本 1.0（2025年4月23日） ## 授权协议 CC BY 4.0 ## 预期用途本数据集旨在供社区用于持续优化大语言模型，可自由用于模型训练与评估工作。 ## 伦理考量英伟达（NVIDIA）认为，可信人工智能是一项共同责任，我们已建立相关政策与实践规范，以支持各类人工智能应用的开发。开发者在按照本服务条款下载或使用本数据集时，应与内部模型团队协作，确保模型符合相关行业与应用场景的要求，并防范可能出现的产品误用问题。如需报告安全漏洞或英伟达人工智能相关问题，请访问[此处](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

提供机构：

maas

创建时间：

2025-04-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集