five

Nemotron-MIND

收藏
魔搭社区2025-12-04 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Nemotron-MIND
下载链接
链接失效反馈
官方服务:
资源简介:
# Nemotron-MIND: Math Informed syNthetic Dialogues for Pretraining LLMs **Authors: Syeda Nahida Akter, Shrimai Prabhumoye, John Kamalu, Sanjeev Satheesh, Eric Nyberg, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro** [[Paper]](https://arxiv.org/abs/2410.12881) [[Blog]](https://research.nvidia.com/labs/adlr/Nemotron-MIND/) ## Dataset Description <div align="center"> <img class="img-full" src="MIND_overview.png" width=900> <p style="max-width: 900px; text-align: justify"> <b>Figure 1: Math Informed syNthetic Dialogue.</b> We (a) manually design prompts of seven conversational styles, (b) provide the prompt along with raw context as input to an LLM to obtain diverse synthetic conversations, (c) apply heuristic filtering to refine the generated data and (d) observe the downstream task accuracy after continuously pretraining a 7B LLM. </p> </div> The Nemotron-MIND Dataset is a compilation of pretraining data that supports improvements of math reasoning capabilities of the Nemotron5 series of models. This dataset release represents a significant advancement in openness and transparency in model development. Nemotron-MIND contains over 138 billion tokens of structured mathematical dialogues generated by [Nemotron4-340B-Instruct](https://huggingface.co/nvidia/Nemotron-4-340B-Instruct). The data synthesis process comprises of the following phases: - **Compose Diverse Prompts:** We design seven prompt templates to guide a pretrained LLM in converting a single math text into a structured conversation. They represent different social conversational settings like: (1) Debate, (2) Problem-Solving, (3) Layman-Knowall, (4) Teacher-Student, (5) Two-Professors, (6) Two-Students, and (7) Interview-Interviewee. - **Raw Data:** We use OpenWebMath (OWM) as our base corpus—14.7B tokens of rich, raw math content. - **Generate Conversations at Scale:** For each document, we apply a prompt to generate conversation. We use the [Nemotron4-340B-Instruct](https://huggingface.co/nvidia/Nemotron-4-340B-Instruct) model to generate the conversations. - **Filter Noisy Outputs:** LLM-based scoring proved too lenient. Instead, we apply heuristic rules to remove low-quality generations and retain only coherent, detailed discussions. Finally, we continuously pretrain a 7B model on a mix of filtered conversations and raw pretraining data. ## Main Results <div align="center"> <img class="img-full" src="MIND_results.png" width=900> <p style="max-width: 900px; text-align: justify"> <b>Figure 2: Results of 7B LLM pretrained on Diverse Conversational Styles.</b> Continuous training with different conversation styles improves all reasoning tasks. </p> </div> **Key Takeaways:** - Every MIND conversation style beat both raw and rephrased baselines on reasoning tasks. - Gains on **GSM8K** ranged from **+4.78% to +12.82%** — showcasing huge improvements in math problem solving. **MATH (+0.54–1.28%)** and **MMLU-STEM (+0.79–4.28%)** also saw consistent gains. Even **general reasoning** benchmarks improved by up to **+2%** on average across 10 tasks. - The best results among 4B tokens came from the Longest Conversation variant—suggesting that richer, more elaborate dialogue drives stronger reasoning ability. This dataset primarily supports pretraining LLMs from scratch. This dataset demonstrates improvement in math capabilities of pretrained models. The MIND framework uses NemoSKills to synthetically generate math data from OpenWebMath corpora, which is then used to pretrain state-of-the art (SOTA) models. This dataset is ready for commercial/non-commercial use. ## Dataset Owner(s): NVIDIA Corporation ## Dataset Creation Date: September 20, 2024 ## License/Terms of Use: Governing Terms: This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0) available at https://creativecommons.org/licenses/by/4.0/legalcode. This dataset contains data created using OpenWebMath ([https://huggingface.co/datasets/open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math)), which is made available under the ODC Attribution License ([https://opendatacommons.org/licenses/by/1-0/](https://opendatacommons.org/licenses/by/1-0/)). **Data Developer:** NVIDIA ## Intended Usage: The Nemotron-MIND Dataset is intended to be used by the community to pretrain LLMs with SOTA math reasoning capabilities. The data may be used to train and evaluate. <br> ## Data Version: v1 ## Dataset Characterization - Data Collection Method: Synthetic <br> - Labeling Method: Automated <br> ## Dataset Format Text ## Dataset Quantification - Record Count: 231.6M - Feature Count: 7. We have seven different conversational styles in the data. They are: (1) TWO STUDENTS, (2) TEACHER STUDENT, (3) TWO PROFESSORS, (4) DEBATE, (5) PROBLEM SOLVING, (6) LAYMAN KNOWALL, and (7) INTERVIEW. - Total Data Storage: 827GB ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Citation ``` @inproceedings{ akter2025mind, title={{MIND}: Math Informed syNthetic Dialogues for Pretraining {LLM}s}, author={Syeda Nahida Akter and Shrimai Prabhumoye and John Kamalu and Sanjeev Satheesh and Eric Nyberg and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=TuOTSAiHDn} } ```

# Nemotron-MIND:面向大语言模型(Large Language Model,LLM)预训练的数学导向合成对话数据集 **作者:Syeda Nahida Akter、Shrimai Prabhumoye、John Kamalu、Sanjeev Satheesh、Eric Nyberg、Mostofa Patwary、Mohammad Shoeybi、Bryan Catanzaro** [[论文]](https://arxiv.org/abs/2410.12881) [[博客]](https://research.nvidia.com/labs/adlr/Nemotron-MIND/) ## 数据集描述 <div align="center"> <img class="img-full" src="MIND_overview.png" width=900> <p style="max-width: 900px; text-align: justify"> <b>图1:数学导向合成对话。</b> 我们(a)手动设计七种对话风格的提示词,(b)将提示词与原始上下文作为输入送入大语言模型以生成多样化的合成对话,(c)应用启发式过滤来精炼生成的数据,(d)在持续预训练7B参数的大语言模型后观察下游任务精度。 </p> </div> Nemotron-MIND数据集是一组预训练数据,旨在提升Nemotron5系列模型的数学推理能力。本次数据集发布在模型开发的开放性与透明性方面取得了重要进展。Nemotron-MIND包含超过1380亿个Token,均为使用[Nemotron4-340B-Instruct](https://huggingface.co/nvidia/Nemotron-4-340B-Instruct)生成的结构化数学对话。 数据合成流程包含以下阶段: - **构建多样化提示词:** 我们设计了七种提示词模板,用于引导预训练后的大语言模型将单段数学文本转换为结构化对话。这些模板涵盖了七种不同的社交对话场景:(1) 辩论场景、(2) 问题解决场景、(3) 外行-专家对话、(4) 教师-学生对话、(5) 两位教授对话、(6) 两位学生对话、(7) 面试-应聘者对话。 - **原始数据:** 我们以OpenWebMath(OWM)作为基础语料库,该语料库包含147亿个Token的高质量原始数学内容。 - **大规模生成对话:** 针对每一份文档,我们使用对应提示词生成对话。我们采用[Nemotron4-340B-Instruct](https://huggingface.co/nvidia/Nemotron-4-340B-Instruct)模型完成对话生成。 - **过滤低质量输出:** 基于大语言模型的评分被证明过于宽松,因此我们改用启发式规则移除低质量生成结果,仅保留连贯且内容详实的讨论。最终,我们将过滤后的对话与原始预训练数据混合,对7B参数的大语言模型进行持续预训练。 ## 主要结果 <div align="center"> <img class="img-full" src="MIND_results.png" width=900> <p style="max-width: 900px; text-align: justify"> <b>图2:基于多样化对话风格预训练的7B大语言模型结果。</b> 使用不同对话风格进行持续训练能够提升所有推理任务的性能。 </p> </div> **核心结论:** - 所有MIND对话风格在推理任务上的表现均优于原始语料基线与重写基线。 - 在GSM8K基准上的性能提升幅度为+4.78%至+12.82%,展现了数学问题求解能力的大幅提升。MATH基准(提升+0.54%~1.28%)与MMLU-STEM基准(提升+0.79%~4.28%)也实现了稳定的性能提升。即使是通用推理基准,在10项任务上的平均性能也最高提升了+2%。 - 在40亿Token的数据集子集上,最长对话变体取得了最佳结果,这表明更丰富、更细致的对话能够推动模型获得更强的推理能力。 本数据集主要支持从零开始预训练大语言模型,能够有效提升预训练模型的数学推理能力。MIND框架使用NemoSKills从OpenWebMath语料库中合成生成数学数据,随后使用这些数据预训练当前最优(State-of-the-Art,SOTA)的大语言模型。 本数据集可用于商业与非商业用途。 ## 数据集所有者: 英伟达(NVIDIA Corporation) ## 数据集创建日期: 2024年9月20日 ## 许可/使用条款: 管辖条款:本数据集采用知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License,CC BY 4.0)进行授权,详情可访问https://creativecommons.org/licenses/by/4.0/legalcode。 本数据集包含使用OpenWebMath(https://huggingface.co/datasets/open-web-math/open-web-math)生成的数据,该数据集采用ODC署名许可协议(https://opendatacommons.org/licenses/by/1-0/)进行授权。 **数据开发者:英伟达(NVIDIA)** ## 预期用途: Nemotron-MIND数据集旨在供社区使用,用于预训练具备当前最优数学推理能力的大语言模型。本数据集可用于模型的训练与评估。 <br> ## 数据版本:v1 ## 数据集特征 - 数据收集方法:合成生成 <br> - 标注方法:自动化标注 <br> ## 数据集格式 文本 ## 数据集量化信息 - 记录数量:2.316亿条 - 特征数量:7种。本数据集包含七种不同的对话风格,分别为:(1) 两位学生对话、(2) 教师-学生对话、(3) 两位教授对话、(4) 辩论场景、(5) 问题解决场景、(6) 外行-专家对话、(7) 面试-应聘者对话。 - 总数据存储量:827GB ## 伦理考量: 英伟达(NVIDIA)认为可信人工智能是一项共同责任,我们已制定相关政策与实践规范,以支持各类人工智能应用的开发。当开发者按照本服务条款下载或使用本数据集时,应与内部模型团队协作,确保该模型符合相关行业与应用场景的要求,并应对潜在的产品误用问题。 请通过以下链接提交安全漏洞报告或NVIDIA人工智能相关问题:https://www.nvidia.com/en-us/support/submit-security-vulnerability/。 ## 引用格式 @inproceedings{ akter2025mind, title={{MIND}: Math Informed syNthetic Dialogues for Pretraining {LLM}s}, author={Syeda Nahida Akter and Shrimai Prabhumoye and John Kamalu and Sanjeev Satheesh and Eric Nyberg and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro}, booktitle={第十三届国际学习表征会议}, year={2025}, url={https://openreview.net/forum?id=TuOTSAiHDn} }
提供机构:
maas
创建时间:
2025-04-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作