five

Nemotron-Post-Training-Dataset-v2

收藏
魔搭社区2026-05-21 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Nemotron-Post-Training-Dataset-v2
下载链接
链接失效反馈
官方服务:
资源简介:
# Nemotron-Post-Training-Dataset-v2 Release ## Data Overview This dataset adds to NVIDIA’s post-training dataset releases with an extension of SFT and RL data into five target languages: Spanish, French, German, Italian and Japanese. The data supports improvements of math, code, general reasoning, and instruction following capabilities of the [NVIDIA-Nemotron-Nano-9B-v2-Base](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base), in support of release of [NVIDIA-Nemotron-Nano-8B-v2-Reasoning](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2). NVIDIA-Nemotron-Nano-9B is a family of large language models (LLMs) that consists of [NVIDIA-Nemotron-Nano-9B-v2-Base](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base) and [NVIDIA-Nemotron-Nano-9B-v2-Reasoning](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) models. They are successors of [Nemotron-H-8B-Base-8K](https://huggingface.co/nvidia/Nemotron-H-8B-Base-8K) and [Nemotron-H-8B-Reasoning-128K](https://huggingface.co/nvidia/Nemotron-H-8B-Reasoning-128K), created with commercial use in mind. The NVIDIA-Nemotron-Nano-9B-v2-Reasoning model is aligned for human chat preferences and tasks. The reasoning model supports a context length of 128K tokens. For this latest model, NVIDIA also released pre-training dataset: [Nemotron-Pre-Training-Dataset](https://huggingface.co/collections/nvidia/nemotron-pre-training-dataset-689d9de36f84279d83786b35) This dataset release represents a significant move forward in openness and transparency in model development and improvement. By releasing the training set, in addition to the training technique, tools and final model weights, NVIDIA supports both the re-creation and the improvement of our approach. ## Data distribution | Category | Value | |----------------|-------------| | math | 239467 | | code | 175000 | | stem | 355000 | | chat | 627720 | | multilingual_ja | 975202 | | multilingual_de | 1015314 | | multilingual_it | 1016503 | | multilingual_es | 935704 | | multilingual_fr | 1001504 | ## Filtering the data Users can download subsets of the data based on the metadata schema described above. Example script for downloading code and math as follows: ``` from datasets import load_dataset ds = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v2", "SFT", split=["code", "math"]) ``` ## Prompts Prompts have been sourced from either public and open corpus or synthetically generated. All responses have been synthetically generated from public and open models. The prompts were extracted, and then filtered for quality and complexity, or generated to meet quality and complexity requirements. This included filtration such as removing inconsistent prompts, prompts with answers that are easy to guess, and removing prompts with incorrect syntax. ## Responses Responses were synthetically generated by a variety of models, with some prompts containing responses for both reasoning on and off modes, to train the model to distinguish between two modes. The reasoning traces are presented only in English, not the target language, as most of the pre-training corpus is in English. Here is the completed table with the aggregated counts for the models that were used in the creation of this dataset. Please note that multiple models may have been used to produce a single record so it may not always be a 1:1 mapping. | Model | Number of Samples | | :--- | :--- | | [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) | 5,713,694 | | [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) | 3,928,913 | | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | 627,720 | | [Qwen2.5-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-AWQ) | 1,015,314 | | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) | 627,720 | ## License/Terms of Use The dataset contains information about license type on a per sample basis. The dataset is predominantly CC-BY-4.0, with a small subset of prompts from Wildchat having an ODC-BY license and a small subset of prompts from StackOverflow with CC-BY-SA license. This dataset contains synthetic data created using [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528), [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct), [Qwen2.5-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-AWQ), [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) and [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507). If this dataset is used to create, train, fine-tune, or otherwise improve an AI model, which is distributed or made available, such AI model may be subject to redistribution and use requirements in the [Qwen License Agreement](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507/blob/main/LICENSE) and the [DeepSeek License Agreement](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/LICENSE). **Data Developer:** NVIDIA ### Use Case: <br> Developers training foundation LLM models. <br> ### Release Date: <br> 8/20/2025 <br> ## Data Version 2.0 (8/20/2025) ## Intended use The Nemotron Post-Training Dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train and evaluate. ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Data Opt-Out: NVIDIA has undertaken legal review to ensure there is no confidential, PII or copyright materials. If, when reviewing or using this dataset, you identify issues with the data itself, such as those listed above, please contact nemotron-data@nvidia.com. ## Citation If you found this dataset useful, please cite the dataset and the model below : ``` @software{NemotronPostTrainingDatasetV2, author = {Nathawani, Dhruv and Ding, Shuoyang and Lavrukhin, Vitaly and Gitman, Igor and Majumdar, Somshubra and Bakhturina, Evelina and Ginsburg, Boris and Polak Scowcroft, Jane}, title = {{Nemotron-Post-Training-Dataset-v2}}, version = {2.0}, publisher = {{NVIDIA}}, year = {2025}, month = aug, url = {https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2} } ``` ``` @misc{nvidia2025nvidianemotronnano2, title={NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model}, author={NVIDIA}, year={2025}, eprint={2508.14444}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.14444}, } } ```

# Nemotron后训练数据集v2 发布 ## 数据概览 本数据集作为NVIDIA后训练数据集系列的新增成员,将监督微调(Supervised Fine-Tuning,SFT)与强化学习(Reinforcement Learning,RL)数据扩展至西班牙语、法语、德语、意大利语及日语五种目标语言。该数据集可用于提升[NVIDIA-Nemotron-Nano-9B-v2-Base](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base)的数学、代码、通用推理与指令遵循能力,助力[NVIDIA-Nemotron-Nano-9B-v2-Reasoning](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)的发布。 NVIDIA-Nemotron-Nano-9B是一类大语言模型(Large Language Model,LLM),包含[NVIDIA-Nemotron-Nano-9B-v2-Base](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base)与[NVIDIA-Nemotron-Nano-9B-v2-Reasoning](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)两款模型,作为[Nemotron-H-8B-Base-8K](https://huggingface.co/nvidia/Nemotron-H-8B-Base-8K)与[Nemotron-H-8B-Reasoning-128K](https://huggingface.co/nvidia/Nemotron-H-8B-Reasoning-128K)的继任者,专为商业使用场景开发。 NVIDIA-Nemotron-Nano-9B-v2-Reasoning模型已针对人类对话偏好与任务完成对齐优化,该推理模型支持128K Token的上下文长度。 针对这款最新模型,NVIDIA同时发布了预训练数据集:[Nemotron-Pre-Training-Dataset](https://huggingface.co/collections/nvidia/nemotron-pre-training-dataset-689d9de36f84279d83786b35)。 本次数据集发布是模型开发与优化领域在开放性与透明度上的重要进步。除训练技术、工具与最终模型权重外,NVIDIA还公开了训练集,旨在支持研究人员复现并改进本研究的方法路径。 ## 数据分布 | 类别 | 数值 | |----------------|-------------| | 数学(math) | 239467 | | 代码(code) | 175000 | | 理工科(Science, Technology, Engineering, Mathematics,STEM) | 355000 | | 对话(chat) | 627720 | | 多语言日语(multilingual_ja) | 975202 | | 多语言德语(multilingual_de) | 1015314 | | 多语言意大利语(multilingual_it) | 1016503 | | 多语言西班牙语(multilingual_es) | 935704 | | 多语言法语(multilingual_fr) | 1001504 | ## 数据筛选 用户可基于前文所述的元数据模式(metadata schema)下载数据子集。以下为下载代码与数学子集的示例脚本: from datasets import load_dataset ds = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v2", "SFT", split=["code", "math"]) ## 提示词 提示词来源包括公开开放语料与人工合成生成。所有回复均由公开开放模型合成生成。 研究人员先提取提示词,再基于质量与复杂度进行筛选,或直接生成符合质量与复杂度要求的提示词。筛选流程包括移除不一致的提示词、答案易于猜测的提示词,以及存在语法错误的提示词。 ## 回复内容 回复由多款模型合成生成,部分提示词包含开启与关闭推理模式的回复,用于训练模型区分两种模式。由于多数预训练语料为英文,推理过程仅以英文呈现,而非目标语言。 以下为构建本数据集所用模型的聚合样本数统计表。请注意,单条样本可能由多款模型共同生成,因此二者未必一一对应。 | 模型 | 样本数量 | | :--- | :--- | | [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) | 5,713,694 | | [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) | 3,928,913 | | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | 627,720 | | [Qwen2.5-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-AWQ) | 1,015,314 | | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) | 627,720 | ## 使用许可与条款 本数据集的每条样本均标注了对应的许可类型。数据集主体采用CC-BY-4.0许可,少量来自Wildchat的提示词采用ODC-BY许可,少量来自StackOverflow的提示词采用CC-BY-SA许可。 本数据集包含由[DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)、[Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)、[Qwen2.5-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-AWQ)、[Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)及[Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)合成生成的数据。若使用本数据集创建、训练、微调或以其他方式改进AI模型并进行分发或公开,则该AI模型需遵守[Qwen许可协议](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507/blob/main/LICENSE)与[DeepSeek许可协议](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/LICENSE)中的再分发与使用要求。 **数据开发者:** NVIDIA ### 应用场景: 开发者用于训练基础大语言模型。 ### 发布日期: 2025年8月20日 ## 数据版本 2.0(2025年8月20日) ## 预期用途 Nemotron后训练数据集旨在供社区用于持续改进开源模型。用户可自由使用该数据进行模型训练与评估。 ## 伦理考量 NVIDIA认为可信AI是一项共同责任,我们已建立相关政策与实践规范,以支持各类AI应用的开发。开发者在遵守本服务条款的前提下下载或使用本数据集时,应与内部模型团队协作,确保该模型符合相关行业与应用场景的要求,并防范未预见的产品误用风险。 请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交安全漏洞或NVIDIA AI相关问题反馈。 ## 数据异议处理 NVIDIA已完成法律审查,确保本数据集未包含涉密信息、个人可识别信息(Personally Identifiable Information,PII)或侵权内容。若您在审阅或使用本数据集时发现上述提及的问题,请联系nemotron-data@nvidia.com。 ## 引用方式 若您认为本数据集对您的研究有所帮助,请引用如下数据集与模型文献: @software{NemotronPostTrainingDatasetV2, author = {Nathawani, Dhruv and Ding, Shuoyang and Lavrukhin, Vitaly and Gitman, Igor and Majumdar, Somshubra and Bakhturina, Evelina and Ginsburg, Boris and Polak Scowcroft, Jane}, title = {{Nemotron-Post-Training-Dataset-v2}}, version = {2.0}, publisher = {{NVIDIA}}, year = {2025}, month = aug, url = {https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2} } @misc{nvidia2025nvidianemotronnano2, title={NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model}, author={NVIDIA}, year={2025}, eprint={2508.14444}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.14444}, } }
提供机构:
maas
创建时间:
2025-08-21
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Nemotron-Post-Training-Dataset-v2是一个由NVIDIA发布的多语言数据集,旨在提升大型语言模型在数学、代码、通用推理和指令跟随方面的能力,特别支持五种目标语言。数据集包含多种数据类型,主要通过公开语料库或合成生成,使用需遵守特定许可协议。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作