five

Nemotron-Post-Training-Dataset-v1

收藏
魔搭社区2026-05-23 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Nemotron-Post-Training-Dataset-v1
下载链接
链接失效反馈
官方服务:
资源简介:
# Nemotron-Post-Training-Dataset-v1 Release This dataset is a compilation of SFT data that supports improvements of math, code, stem, general reasoning, and tool calling capabilities of the original Llama instruct model [Llama-3.3-Nemotron-Super-49B-v1.5](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5). Llama-3.3-Nemotron-Super-49B-v1.5 is an LLM which is a derivative of [Meta Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) (AKA the *reference model*). Llama-3.3-Nemotron-Super-49B-v1.5 offers a great tradeoff between model accuracy and efficiency. Efficiency (throughput) directly translates to savings. Using a novel Neural Architecture Search (NAS) approach, we greatly reduce the model’s memory footprint and enable larger workloads. This NAS approach enables the selection of a desired point in the accuracy-efficiency tradeoff. The model supports a context length of 128K. This dataset release represents a significant move forward in openness and transparency in model development and improvement. By releasing the complete training set, in addition to the training technique, tools and final model weights, NVIDIA supports both the re-creation and the improvement of our approach. This dataset is ready for commercial/non-commercial use. ## Data Distribution | Category | Value | |---------------|---------------:| | chat | 746,622 | | code | 1,896,395 | | math | 2,044,407 | | stem | 20,662,167 | | tool_calling | 310,051 | | **Total** | **25,659,642** | ## Filtering the data Users can download subsets of the data based on the metadata schema described above. Example script for downloading code and math as follows: ``` from datasets import load_dataset ds = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v1", split=["code", "math"]) ``` ## Prompts Prompts have been sourced from either public and open corpus or synthetically generated. All responses have been synthetically generated from public and open models. The prompts were extracted, and then filtered for quality and complexity, or generated to meet quality and complexity requirements. This included filtration such as removing inconsistent prompts, prompts with answers that are easy to guess, and removing prompts with incorrect syntax. ## Responses Responses were synthetically generated by a variety of models, with some prompts containing responses for both reasoning on and off modes, to train the model to distinguish between two modes. Models that were used in the creation of this dataset: | Model | Number of Samples | |------------------|------------------:| | DeepSeek-R1-0528 | 24,602,969 | | Qwen3-235B-A22B | 1,056,673 | | **Total** | **25,659,642** | ## Recommended Training Formats The data in this dataset is provided in a raw format (e.g., a math problem, a coding challenge). For optimal performance during supervised fine-tuning, we recommend wrapping the input field with an instruction template. Below are examples of templates used in our training. ### For the chat split: The chat split is designed for conversational tuning. The input field represents the user's turn and can typically be used directly. The model's system prompt during training could be: ```text You are a helpful and friendly AI assistant. ``` <mark style="background-color: rgba(255, 255, 0, 0.3); padding:0.1em 0.2em; border-radius:3px;">It is important to note that certain prompts in the chat split are sourced externally. For these entries, the 'input' field is empty in 'messages', and users must download the required data from the original source [lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)<mark> ### For the code split: To instruct the model to generate well-explained code, use a format that requests both an explanation and the code block itself: ```text Write a solution for the following programming challenge. Provide a brief explanation of your approach, followed by the complete code. {problem} ``` <mark style="background-color: rgba(255, 255, 0, 0.3); padding:0.1em 0.2em; border-radius:3px;">It is important to note that certain prompts in the code split are sourced externally. For these entries, the 'input' field is empty, and users must download the required data from the original source websites. Additional information can be found in the [OpenCodeReasoning-2 README](https://huggingface.co/datasets/nvidia/OpenCodeReasoning-2#how-to-use-it)<mark> ### For the math split: To guide the model to provide a step-by-step solution and a clearly marked final answer, use a format like this: ```text Solve the following math problem. Explain your reasoning and put the final answer in \\boxed{}. {problem} ``` ### For the stem split: For general reasoning, science, and humanities questions, a straightforward instruction is effective: ```text Read the following problem carefully and provide a detailed, step-by-step answer. {problem} ``` ### For the tool calling split: The tool-calling split covers single-turn, multi-turn, and multi-step tool-calling scenarios. The "tools" in metadata and "tool_calls" in assistant messages should be formatted according to the model's tool-calling template for training. ## Dataset Owner(s): NVIDIA Corporation ## Data Creation Date: 07/15/2025 ## License/Terms of Use This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0) available at https://creativecommons.org/licenses/by/4.0/legalcode. ## Intended use The Nemotron Post-Training Dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train and evaluate. ## Dataset Characterization **Data Collection Method** Synthetic **Labeling Method** Synthetic ## Dataset Format Text Data ## Use Case: Developers training AI Agent systems, chatbots, RAG systems, and other AI-powered applications. ## Release Date: 7/31/2025 ## Data Version 1 (7/31/2025) ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Data Opt-Out: NVIDIA has undertaken legal review to ensure there is no confidential, PII or copyright materials. If, when reviewing or using this dataset, you identify issues with the data itself, such as those listed above, please contact ln-dataset@nvidia.com. ## Citation ``` If you found this dataset useful, please cite the model and dataset as per below: @misc{bercovich2025llamanemotronefficientreasoningmodels, title={Llama-Nemotron: Efficient Reasoning Models}, author={Akhiad Bercovich and Itay Levy and Izik Golan and Mohammad Dabbah and Ran El-Yaniv and Omri Puny and Ido Galil and Zach Moshe and Tomer Ronen and Najeeb Nabwani and Ido Shahaf and Oren Tropp and Ehud Karpas and Ran Zilberstein and Jiaqi Zeng and Soumye Singhal and Alexander Bukharin and Yian Zhang and Tugrul Konuk and Gerald Shen and Ameya Sunil Mahabaleshwarkar and Bilal Kartal and Yoshi Suhara and Olivier Delalleau and Zijia Chen and Zhilin Wang and David Mosallanezhad and Adi Renduchintala and Haifeng Qian and Dima Rekesh and Fei Jia and Somshubra Majumdar and Vahid Noroozi and Wasi Uddin Ahmad and Sean Narenthiran and Aleksander Ficek and Mehrzad Samadi and Jocelyn Huang and Siddhartha Jain and Igor Gitman and Ivan Moshkov and Wei Du and Shubham Toshniwal and George Armstrong and Branislav Kisacanin and Matvei Novikov and Daria Gitman and Evelina Bakhturina and Jane Polak Scowcroft and John Kamalu and Dan Su and Kezhi Kong and Markus Kliegl and Rabeeh Karimi and Ying Lin and Sanjeev Satheesh and Jupinder Parmar and Pritam Gundecha and Brandon Norick and Joseph Jennings and Shrimai Prabhumoye and Syeda Nahida Akter and Mostofa Patwary and Abhinav Khattar and Deepak Narayanan and Roger Waleffe and Jimmy Zhang and Bor-Yiing Su and Guyue Huang and Terry Kong and Parth Chadha and Sahil Jain and Christine Harvey and Elad Segal and Jining Huang and Sergey Kashirsky and Robert McQueen and Izzy Putterman and George Lam and Arun Venkatesan and Sherry Wu and Vinh Nguyen and Manoj Kilaru and Andrew Wang and Anna Warno and Abhilash Somasamudramath and Sandip Bhaskar and Maka Dong and Nave Assaf and Shahar Mor and Omer Ullman Argov and Scot Junkin and Oleksandr Romanenko and Pedro Larroy and Monika Katariya and Marco Rovinelli and Viji Balas and Nicholas Edelman and Anahita Bhiwandiwalla and Muthu Subramaniam and Smita Ithape and Karthik Ramamoorthy and Yuting Wu and Suguna Varshini Velury and Omri Almog and Joyjit Daw and Denys Fridman and Erick Galinkin and Michael Evans and Katherine Luna and Leon Derczynski and Nikki Pope and Eileen Long and Seth Schneider and Guillermo Siman and Tomasz Grzegorzek and Pablo Ribalta and Monika Katariya and Joey Conway and Trisha Saar and Ann Guan and Krzysztof Pawelec and Shyamala Prayaga and Oleksii Kuchaiev and Boris Ginsburg and Oluwatobi Olabiyi and Kari Briski and Jonathan Cohen and Bryan Catanzaro and Jonah Alben and Yonatan Geifman and Eric Chung and Chris Alexiuk}, year={2025}, eprint={2505.00949}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.00949}, } @software{NemotronPostTrainingDatasetV1, author = {Nathawani, Dhruv and Gitman, Igor and Majumdar, Somshubra and Bakhturina, Evelina and Sunil Mahabaleshwarkar, Ameya and and Zhang, Jian and Polak Scowcroft, Jane}, title = {{Nemotron-Post-Training-Dataset-v1}}, version = {1.0}, publisher = {{NVIDIA}}, year = {2025}, month = July, url = {https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1} } ```

# Nemotron 后训练数据集 v1 发布版 本数据集为监督微调(SFT,Supervised Fine-Tuning)数据合集,用于优化原版Llama指令模型[Llama-3.3-Nemotron-Super-49B-v1.5](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5)的数学、代码、理工科(STEM,Science, Technology, Engineering, Mathematics)、通用推理以及工具调用(tool calling)能力。 Llama-3.3-Nemotron-Super-49B-v1.5是一款大语言模型(LLM,Large Language Model),为[Meta Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)(亦称「参考模型」)的衍生模型。 该模型在模型精度与效率之间实现了出色的平衡。效率(吞吐量)直接对应算力成本节约。通过新颖的神经架构搜索(NAS,Neural Architecture Search)方法,我们大幅降低了模型的内存占用,支持更大规模的推理任务。该NAS方法可在精度-效率权衡空间中选取期望的平衡点。模型支持128K上下文长度。 本次数据集发布是模型开发与优化领域在开放性与透明度上的重要进步。除训练技术、工具与最终模型权重外,我们还公开了完整训练集,NVIDIA此举旨在支持社区复现并进一步优化本研究方案。本数据集可用于商业与非商业用途。 ## 数据分布 | 类别 | 样本量 | |---------------|---------------:| | chat | 746,622 | | code | 1,896,395 | | math | 2,044,407 | | stem | 20,662,167 | | tool_calling | 310,051 | | **总计** | **25,659,642** | ## 数据筛选 用户可基于上述元数据架构下载数据集子集。以下为下载代码与数学类数据的示例脚本: from datasets import load_dataset ds = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v1", split=["code", "math"]) ## 提示词 提示词来源涵盖公开开放语料与人工合成生成两类。所有回复均由公开开放模型合成生成。 我们对提示词进行提取,并基于质量与复杂度要求进行筛选,或直接生成符合质量与复杂度标准的提示词。具体过滤规则包括移除格式不一致的提示词、答案易被猜测的提示词,以及存在语法错误的提示词。 ## 回复内容 回复由多款模型合成生成,部分提示词包含开启与关闭推理模式的两类回复,用于训练模型区分这两种模式。本数据集构建过程中使用的模型如下: | 模型名称 | 样本量 | |------------------|------------------:| | DeepSeek-R1-0528 | 24,602,969 | | Qwen3-235B-A22B | 1,056,673 | | **总计** | **25,659,642** | ## 推荐训练格式 本数据集数据以原始格式提供(例如数学题、编程挑战)。为在监督微调中获得最优性能,我们建议使用指令模板封装输入字段。以下为我们训练时使用的模板示例: ### 聊天类拆分 本拆分用于对话微调。输入字段代表用户发言,通常可直接使用。训练时的模型系统提示词可为: text You are a helpful and friendly AI assistant. 需注意,聊天类拆分中的部分提示词源自外部数据源。对于此类条目,`messages` 字段中的 `input` 为空,用户需从原始来源 [lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) 下载所需数据。 ### 代码类拆分 若要指导模型生成带有详细解释的代码,可使用同时要求解题思路说明与代码块的格式: text Write a solution for the following programming challenge. Provide a brief explanation of your approach, followed by the complete code. {problem} 需注意,代码类拆分中的部分提示词源自外部数据源。对于此类条目,`input` 字段为空,用户需从原始来源网站下载所需数据。更多信息可参考 [OpenCodeReasoning-2 数据集说明文档](https://huggingface.co/datasets/nvidia/OpenCodeReasoning-2#how-to-use-it)。 ### 数学类拆分 若要引导模型提供分步解题过程与明确标注的最终答案,可使用如下格式: text Solve the following math problem. Explain your reasoning and put the final answer in \boxed{}. {problem} ### 理工科(STEM)类拆分 针对通用推理、科学与人文类问题,使用直白的指令即可取得良好效果: text Read the following problem carefully and provide a detailed, step-by-step answer. {problem} ### 工具调用类拆分 本拆分涵盖单轮、多轮及多步骤工具调用场景。元数据中的`tools`字段与助手消息中的`tool_calls`字段需按照模型的工具调用模板进行格式化,以用于训练。 ## 数据集所有者 NVIDIA公司 ## 数据创建日期 2025年7月15日 ## 许可与使用条款 本数据集采用知识共享署名4.0国际许可协议(CC BY 4.0)进行授权,许可协议详情见 https://creativecommons.org/licenses/by/4.0/legalcode。 ## 预期用途 Nemotron后训练数据集旨在供社区用于持续优化开源模型。该数据可自由用于模型训练与评估。 ## 数据集特征 **数据收集方式**:合成生成 **标注方式**:合成生成 ## 数据集格式 文本数据 ## 适用场景 开发人员用于训练AI智能体(AI Agent)系统、聊天机器人、检索增强生成(RAG,Retrieval-Augmented Generation)系统及其他人工智能应用。 ## 发布日期 2025年7月31日 ## 数据版本 1.0(2025年7月31日) ## 伦理考量 NVIDIA认为可信人工智能是一项共同责任,我们已制定相关政策与实践规范,以支持各类AI应用的开发。开发者在按照本服务条款下载或使用本数据集时,应与内部模型团队协作,确保本模型符合相关行业与应用场景的要求,并防范产品被不当使用。 请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)报告安全漏洞或NVIDIA人工智能相关问题。 ## 数据移除申请 NVIDIA已完成法律审查,确保本数据集不包含机密信息、个人可识别信息(PII,Personally Identifiable Information)或受版权保护的材料。若您在审阅或使用本数据集时发现上述或其他相关问题,请联系 ln-dataset@nvidia.com。 ## 引用方式 若您认为本数据集对研究有所帮助,请按照以下格式引用该模型与数据集: @misc{bercovich2025llamanemotronefficientreasoningmodels, title={Llama-Nemotron: Efficient Reasoning Models}, author={Akhiad Bercovich and Itay Levy and Izik Golan and Mohammad Dabbah and Ran El-Yaniv and Omri Puny and Ido Galil and Zach Moshe and Tomer Ronen and Najeeb Nabwani and Ido Shahaf and Oren Tropp and Ehud Karpas and Ran Zilberstein and Jiaqi Zeng and Soumye Singhal and Alexander Bukharin and Yian Zhang and Tugrul Konuk and Gerald Shen and Ameya Sunil Mahabaleshwarkar and Bilal Kartal and Yoshi Suhara and Olivier Delalleau and Zijia Chen and Zhilin Wang and David Mosallanezhad and Adi Renduchintala and Haifeng Qian and Dima Rekesh and Fei Jia and Somshubra Majumdar and Vahid Noroozi and Wasi Uddin Ahmad and Sean Narenthiran and Aleksander Ficek and Mehrzad Samadi and Jocelyn Huang and Siddhartha Jain and Igor Gitman and Ivan Moshkov and Wei Du and Shubham Toshniwal and George Armstrong and Branislav Kisacanin and Matvei Novikov and Daria Gitman and Evelina Bakhturina and Jane Polak Scowcroft and John Kamalu and Dan Su and Kezhi Kong and Markus Kliegl and Rabeeh Karimi and Ying Lin and Sanjeev Satheesh and Jupinder Parmar and Pritam Gundecha and Brandon Norick and Joseph Jennings and Shrimai Prabhumoye and Syeda Nahida Akter and Mostofa Patwary and Abhinav Khattar and Deepak Narayanan and Roger Waleffe and Jimmy Zhang and Bor-Yiing Su and Guyue Huang and Terry Kong and Parth Chadha and Sahil Jain and Christine Harvey and Elad Segal and Jining Huang and Sergey Kashirsky and Robert McQueen and Izzy Putterman and George Lam and Arun Venkatesan and Sherry Wu and Vinh Nguyen and Manoj Kilaru and Andrew Wang and Anna Warno and Abhilash Somasamudramath and Sandip Bhaskar and Maka Dong and Nave Assaf and Shahar Mor and Omer Ullman Argov and Scot Junkin and Oleksandr Romanenko and Pedro Larroy and Monika Katariya and Marco Rovinelli and Viji Balas and Nicholas Edelman and Anahita Bhiwandiwalla and Muthu Subramaniam and Smita Ithape and Karthik Ramamoorthy and Yuting Wu and Suguna Varshini Velury and Omri Almog and Joyjit Daw and Denys Fridman and Erick Galinkin and Michael Evans and Katherine Luna and Leon Derczynski and Nikki Pope and Eileen Long and Seth Schneider and Guillermo Siman and Tomasz Grzegorzek and Pablo Ribalta and Monika Katariya and Joey Conway and Trisha Saar and Ann Guan and Krzysztof Pawelec and Shyamala Prayaga and Oleksii Kuchaiev and Boris Ginsburg and Oluwatobi Olabiyi and Kari Briski and Jonathan Cohen and Bryan Catanzaro and Jonah Alben and Yonatan Geifman and Eric Chung and Chris Alexiuk}, year={2025}, eprint={2505.00949}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.00949}, } @software{NemotronPostTrainingDatasetV1, author = {Nathawani, Dhruv and Gitman, Igor and Majumdar, Somshubra and Bakhturina, Evelina and Sunil Mahabaleshwarkar, Ameya and and Zhang, Jian and Polak Scowcroft, Jane}, title = {{Nemotron-Post-Training-Dataset-v1}}, version = {1.0}, publisher = {{NVIDIA}}, year = {2025}, month = July, url = {https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1} }
提供机构:
maas
创建时间:
2025-07-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作