five

Puzzle-KD-Nemotron-Post-Training-Dataset-v2

收藏
魔搭社区2026-01-06 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Puzzle-KD-Nemotron-Post-Training-Dataset-v2
下载链接
链接失效反馈
官方服务:
资源简介:
# Puzzle-KD-Nemotron-Post-Training-Dataset-v2 Release ## Dataset Overview The Puzzle-KD-Nemotron-Post-Training-Dataset-v2 dataset is a curated and filtered subset of NVIDIA’s [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2). The original dataset was released by NVIDIA in August 2025 as part of the NVIDIA Nemotron Nano 9B model family and was designed to improve post-training alignment for reasoning, math, code, STEM, and chat capabilities across multiple languages. It contains synthetic SFT and RL data generated by a set of open models, supporting the training of open reasoning-enabled language models. This derived version was specifically created for use in the **Puzzle algorithm**, a Neural Architecture Search (NAS) framework designed to **search and prune models** to achieve smaller and faster neural networks. Within the Puzzle workflow, this dataset is used during several critical stages, including pruning, scoring, knowledge distillation and validation. By serving as a standardized and reproducible English-only dataset, Puzzle supports systematic model optimization without reasoning-trace interference, ensuring compatibility with lightweight, high-performance model objectives. ## Dataset Owner: NVIDIA Corporation ## Dataset Creation Date: 10/15/2025 ## License/Terms of Use The dataset contains information about license type on a per sample basis. The dataset is predominantly CC-BY-4.0, with a small subset of prompts from Wildchat having an ODC-BY license and a small subset of prompts from StackOverflow with CC-BY-SA license. This dataset contains synthetic data created using [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528), [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct), [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) and [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507). If this dataset is used to create, train, fine-tune, or otherwise improve an AI model, which is distributed or made available, such AI model may be subject to redistribution and use requirements in the [Qwen License Agreement](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507/blob/main/LICENSE) and the [DeepSeek License Agreement](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/LICENSE). ## Dataset Creation Process To create **Puzzle-KD-Nemotron-Post-Training-Dataset-v2**, the following modifications were applied upon the original Nemotron-Post-Training-Dataset-v2: - All relevant subsets were concatenated: `code`, `math`, `stem`, and `chat`. - All multilingual samples (Spanish, French, German, Italian, Japanese, etc.) were filtered out, retaining only entries in English. - Samples where the metadata field `reasoning` was set to `"on"` were removed, preserving only those with reasoning traces turned off. - The resulting dataset was split deterministically into: - 95% training set - 5% validation set These steps produce a streamlined dataset optimized for fine-tuning, evaluation, and multi-stage optimization processes within the Puzzle NAS pipeline. Code used to produce this dataset: ```python import datasets ds = datasets.load_dataset(dataset_path, split=["code", "math", "stem", "chat"]) ds = datasets.concatenate_datasets(ds) # Filter out samples with reasoning = on ds = ds.filter(lambda x: x["reasoning"] == "off") # Hardcoded for dynamically create a deterministic train-val split seed = 408 generator = np.random.RandomState(seed=seed) ds = ds.train_test_split(test_size=0.05, shuffle=True, generator=generator) ds = datasets.DatasetDict({ "train": ds["train"], "validation": ds["test"], }) ``` ## Intended Use The Puzzle-KD-Nemotron-Post-Training-Dataset-v2 dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train and evaluate LLMs, and is also suitable for use within Neural Architecture Search or pruning algorithms such as Puzzle. ## Data Access and Loading Example You can load the dataset directly using the Hugging Face `datasets` library as follows: ```python from datasets import load_dataset dataset = load_dataset("nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2") # To access the deterministic split: train_dataset = dataset["train"] val_dataset = dataset["validation"] ``` ## Dataset Quantification | Subset | Samples | |-------------|----------------| | train | 808775 | | validation | 42568 | The total number of samples in the dataset is 0.85M Storage size: 2.74 GB ### Release Date: <br> 11/13/2025 <br> ## Data Version 1.0 (10/16/2025) ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Data Opt-Out: NVIDIA has undertaken legal review to ensure there is no confidential, PII or copyright materials. If, when reviewing or using this dataset, you identify issues with the data itself, such as those listed above, please contact nemotron-data@nvidia.com. ## Citation If you found this dataset useful, please cite: ``` @software{PuzzleKDNemotronPostTraining-Datasetv2, author = {NVIDIA}, title = {Puzzle-KD-Nemotron-Post-Training-Dataset-v2}, year = {2025}, month = nov, url={https://huggingface.co/datasets/nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2}, } ```

# Puzzle-KD-Nemotron-Post-Training-Dataset-v2 版本发布 ## 数据集概览 Puzzle-KD-Nemotron-Post-Training-Dataset-v2数据集是NVIDIA的[Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2)的精选过滤子集。原始数据集由NVIDIA于2025年8月发布,隶属于NVIDIA Nemotron Nano 9B模型系列,旨在提升多语言环境下推理、数学、代码、STEM以及对话能力的后训练对齐效果。该数据集包含由多个开源模型生成的合成监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)数据,可支持具备推理能力的开源语言模型的训练。本衍生版本专为**Puzzle算法**打造,后者是一种神经架构搜索(Neural Architecture Search, NAS)框架,用于**搜索并剪枝模型**以获得更小、更快的神经网络。在Puzzle工作流中,该数据集被应用于剪枝、评分、知识蒸馏与验证等多个关键阶段。 作为标准化且可复现的纯英文数据集,本数据集可避免推理轨迹的干扰,支持系统性的模型优化,同时兼容轻量、高性能的模型目标。 ## 数据集所有者: NVIDIA Corporation ## 数据集创建日期: 10/15/2025 ## 许可证与使用条款 本数据集的每个样本均标注了对应的许可证类型。整体数据集以知识共享署名4.0(CC-BY-4.0)为主,少量来自Wildchat的提示词采用ODC-BY许可证,少量来自StackOverflow的提示词采用CC-BY-SA许可证。 本数据集包含使用[DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)、[Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)、[Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)、[Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)以及[Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)生成的合成数据。若使用本数据集创建、训练、微调或以其他方式改进并分发或公开提供人工智能模型,则该人工智能模型需遵守[Qwen许可证协议](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507/blob/main/LICENSE)与[DeepSeek许可证协议](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/LICENSE)中的再分发与使用要求。 ## 数据集创建流程 为创建Puzzle-KD-Nemotron-Post-Training-Dataset-v2,我们对原始Nemotron-Post-Training-Dataset-v2进行了如下修改: - 合并所有相关子集:`code`、`math`、`stem`与`chat`。 - 过滤掉所有多语言样本(西班牙语、法语、德语、意大利语、日语等),仅保留英文条目。 - 移除元数据字段`reasoning`设置为`"on"`的样本,仅保留推理轨迹关闭的条目。 - 将最终数据集按确定性方式划分为: - 95% 训练集 - 5% 验证集 上述步骤生成了一个精简后的数据集,专为Puzzle神经架构搜索流水线中的微调、评估与多阶段优化流程打造。 用于生成该数据集的代码如下: python import datasets ds = datasets.load_dataset(dataset_path, split=["code", "math", "stem", "chat"]) ds = datasets.concatenate_datasets(ds) # Filter out samples with reasoning = on ds = ds.filter(lambda x: x["reasoning"] == "off") # Hardcoded for dynamically create a deterministic train-val split seed = 408 generator = np.random.RandomState(seed=seed) ds = ds.train_test_split(test_size=0.05, shuffle=True, generator=generator) ds = datasets.DatasetDict({ "train": ds["train"], "validation": ds["test"], }) ## 预期用途 Puzzle-KD-Nemotron-Post-Training-Dataset-v2数据集旨在供社区进一步改进开源模型。该数据可自由用于训练与评估大语言模型(Large Language Model, LLM),同样适用于神经架构搜索或剪枝类算法(如Puzzle算法)。 ## 数据访问与加载示例 您可通过Hugging Face的`datasets`库直接加载该数据集,示例代码如下: python from datasets import load_dataset dataset = load_dataset("nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2") # To access the deterministic split: train_dataset = dataset["train"] val_dataset = dataset["validation"] ## 数据集量化统计 | 子集 | 样本数量 | |-------------|----------------| | 训练集 | 808775 | | 验证集 | 42568 | 本数据集总样本量约为85万,存储大小为2.74 GB。 ### 发布日期: <br> 11/13/2025 <br> ## 数据版本 1.0 (10/16/2025) ## 伦理考量: NVIDIA认为可信人工智能是一项共同责任,我们已制定相关政策与实践,以支持各类人工智能应用的开发。开发者在遵循本服务条款下载或使用本数据集时,应与其内部模型团队协作,确保该模型符合相关行业与使用场景的要求,并应对可能出现的产品误用问题。 请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)报告安全漏洞或NVIDIA人工智能相关问题。 ## 数据退出机制: NVIDIA已开展法律审查,确保本数据集不包含机密信息、个人可识别信息(Personally Identifiable Information, PII)或受版权保护的材料。若您在查看或使用本数据集时发现上述或其他相关问题,请联系nemotron-data@nvidia.com。 ## 引用方式 若您认为本数据集对您有帮助,请引用如下文献: @software{PuzzleKDNemotronPostTraining-Datasetv2, author = {NVIDIA}, title = {Puzzle-KD-Nemotron-Post-Training-Dataset-v2}, year = {2025}, month = nov, url={https://huggingface.co/datasets/nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2}, }
提供机构:
maas
创建时间:
2025-11-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作