Nemotron-Math-HumanReasoning

Name: Nemotron-Math-HumanReasoning
Creator: maas
Published: 2025-12-04 16:41:49
License: 暂无描述

魔搭社区2025-12-04 更新2025-07-19 收录

下载链接：

https://modelscope.cn/datasets/nv-community/Nemotron-Math-HumanReasoning

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset accompanying the paper: [The Challenge of Teaching Reasoning to LLMs Without RL or Distillation](https://arxiv.org/abs/2507.09850). All experiments were conducted using [NeMo-Skills](https://github.com/NVIDIA/NeMo-Skills). ## Dataset Description: Nemotron-Math-HumanReasoning is a compact dataset featuring human-written solutions to math problems, designed to emulate the extended reasoning style of models like DeepSeek-R1. These solutions are authored by students with deep experience in Olympiad-level mathematics. We provide multiple versions of the solutions and report on how training with this data compares to training on model-generated solutions from QwQ-32B-Preview. We hope this release supports further research into the key elements that contribute to effective "reasoning-style" solutions. The dataset is available for non-commercial use. <table with results of training on different solution groups> We evaluate the performance of models fine-tuned on two types of training data: human-written data and reasoning data generated by QwQ-32B-Preview. Specifically, we fine-tune Qwen2.5-32B on each dataset (50 examples per dataset), including different stages of human-curated data (human_stage1 to human_stage4) and synthetic data from QwQ-32B-Preview. The resulting models are evaluated on several popular mathematical benchmarks, and the results are reported in the table below. | Data | AIME24 | AIME25 | HMMT-24-25 | |-------------------------|-----------------|---------------|--------------| | human_stage1 | 9.27 (33.33) | 6.25 (20.00) | 4.82 (11.73) | | human_stage2 | 6.93 (20.00) | 5.83 (20.00) | 4.62 (11.22) | | human_stage3 | 6.41 (20.00) | 4.69 (20.00) | 3.37 (8.16) | | human_stage4 | 6.41 (26.67) | 5.26 (23.33) | 4.05 (9.69) | | **qwq_32b_preview** | 28.02 (56.67) | 21.61 (40.00) | 16.80 (30.10)| We present metrics as pass@1 (maj@64) where pass@1 is an average accuracy across 64 generations and maj@64 is the result of majority voting. Please see our [paper](https://arxiv.org/abs/2507.09850) for more details on the evaluation setup. ## Dataset Owner(s): NVIDIA Corporation ## Dataset Creation Date: 06/01/2025 ## License/Terms of Use: [cc-by-nc-4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en) ## Dataset fields: Nemotron-Math-HumanReasoning dataset contains the following fields: - **problem**: Math problem statement. - **solution**: Solution to this problem. Either written by people (human) or generated with QwQ-32B-Preview (automated). - **solution_type**: “human_stageX” (X is a number from 1 to 4) or “qwq_32b_preview”. See our [paper]((https://arxiv.org/abs/2507.09850) to learn more details. - **expected_answer**: The ground-truth answer to the problem. ## Dataset Size Nemotron-Math-HumanReasoning consists of 50 math problems from [OpenMathReasoning dataset](https://huggingface.co/datasets/nvidia/OpenMathReasoning) and 200 solutions written by humans, as well as 50 extra solutions generated by QwQ-32B-Preview. The dataset size is 2.4 Mb. ## Dataset Format Text-only. ## Reference(s): Link to [paper](https://arxiv.org/abs/2507.09850) ## Intended Usage: Training language models. ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

本数据集配套论文：《无需强化学习或知识蒸馏的大语言模型（Large Language Model）推理教学挑战》（[The Challenge of Teaching Reasoning to LLMs Without RL or Distillation](https://arxiv.org/abs/2507.09850)）。所有实验均基于[NeMo-Skills](https://github.com/NVIDIA/NeMo-Skills)平台开展。 ## 数据集描述： Nemotron-Math-HumanReasoning 是一款轻量化数据集，收录针对数学题的人工撰写解题过程，旨在复刻DeepSeek-R1等模型的长链条推理风格。本数据集的解题过程均由具备奥林匹克数学竞赛深厚经验的学生完成。我们提供多版本解题方案，并对比了使用本数据集训练与使用QwQ-32B-Preview生成的模型解题方案训练的效果。我们希望本次开源能够推动针对高效"推理风格"解题方案核心要素的后续研究。本数据集仅可用于非商业用途。 <table with results of training on different solution groups> 我们针对两类训练数据微调后的模型性能进行评估：人工撰写的解题数据，以及QwQ-32B-Preview生成的推理数据。具体而言，我们针对每类数据集（每数据集含50个样本）对Qwen2.5-32B进行微调，涵盖不同阶段的人工整理数据（human_stage1至human_stage4）以及来自QwQ-32B-Preview的合成数据。最终得到的模型将在多个主流数学基准测试集上进行评估，结果如下表所示。 | 训练数据类型 | AIME24 | AIME25 | HMMT-24-25 | |-------------------------|-----------------|---------------|--------------| | human_stage1 | 9.27 (33.33) | 6.25 (20.00) | 4.82 (11.73) | | human_stage2 | 6.93 (20.00) | 5.83 (20.00) | 4.62 (11.22) | | human_stage3 | 6.41 (20.00) | 4.69 (20.00) | 3.37 (8.16) | | human_stage4 | 6.41 (26.67) | 5.26 (23.33) | 4.05 (9.69) | | **qwq_32b_preview** | 28.02 (56.67) | 21.61 (40.00) | 16.80 (30.10)| 我们采用pass@1（maj@64）作为评估指标：其中pass@1为64次生成结果的平均准确率，maj@64为多数投票（majority voting）结果。有关评估设置的更多细节，请参阅我们的[论文](https://arxiv.org/abs/2507.09850)。 ## 数据集所有者：英伟达公司（NVIDIA Corporation） ## 数据集创建日期： 2025年6月1日（06/01/2025） ## 使用许可条款： [知识共享署名-非商业性使用4.0国际许可协议（cc-by-nc-4.0）](https://creativecommons.org/licenses/by-nc/4.0/deed.en) ## 数据集字段： Nemotron-Math-HumanReasoning 数据集包含以下字段： - **problem**：数学题题干。 - **solution**：该题的解题过程，可由人工撰写，或由QwQ-32B-Preview自动生成。 - **solution_type**：标识解题过程的类型，为"human_stageX"（X为1至4的整数）或"qwq_32b_preview"。更多细节请参阅我们的[论文](https://arxiv.org/abs/2507.09850)。 - **expected_answer**：该题的标准答案（ground-truth answer）。 ## 数据集规模： Nemotron-Math-HumanReasoning 数据集包含来自[OpenMathReasoning数据集](https://huggingface.co/datasets/nvidia/OpenMathReasoning)的50道数学题、200份人工撰写的解题过程，以及由QwQ-32B-Preview生成的额外50份解题方案。本数据集总大小为2.4兆字节（Mb）。 ## 数据集格式：纯文本。 ## 参考文献： [论文链接](https://arxiv.org/abs/2507.09850) ## 预期用途：本数据集旨在用于大语言模型训练。 ## 伦理考量：英伟达（NVIDIA）认为，可信人工智能（Trustworthy AI）是一项共同责任，我们已建立相关政策与实践规范，以支撑各类人工智能应用的开发。开发者在遵循本服务条款的前提下下载或使用本数据集时，应与其内部模型团队协作，确保所开发的模型符合相关行业与应用场景的要求，并防范不可预见的产品误用情况。若需报告安全漏洞或英伟达人工智能相关问题，请访问[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

提供机构：

maas

创建时间：

2025-07-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集