five

nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2

收藏
Hugging Face2025-11-16 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en size_categories: - 100K<n<1M --- # Puzzle-KD-Nemotron-Post-Training-Dataset-v2 Release ## Dataset Overview The Puzzle-KD-Nemotron-Post-Training-Dataset-v2 dataset is a curated and filtered subset of NVIDIA’s [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2). The original dataset was released by NVIDIA in August 2025 as part of the NVIDIA Nemotron Nano 9B model family and was designed to improve post-training alignment for reasoning, math, code, STEM, and chat capabilities across multiple languages. It contains synthetic SFT and RL data generated by a set of open models, supporting the training of open reasoning-enabled language models. This derived version was specifically created for use in the **Puzzle algorithm**, a Neural Architecture Search (NAS) framework designed to **search and prune models** to achieve smaller and faster neural networks. Within the Puzzle workflow, this dataset is used during several critical stages, including pruning, scoring, knowledge distillation and validation. By serving as a standardized and reproducible English-only dataset, Puzzle supports systematic model optimization without reasoning-trace interference, ensuring compatibility with lightweight, high-performance model objectives. ## Dataset Owner: NVIDIA Corporation ## Dataset Creation Date: 10/15/2025 ## License/Terms of Use The dataset contains information about license type on a per sample basis. The dataset is predominantly CC-BY-4.0, with a small subset of prompts from Wildchat having an ODC-BY license and a small subset of prompts from StackOverflow with CC-BY-SA license. This dataset contains synthetic data created using [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528), [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct), [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) and [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507). If this dataset is used to create, train, fine-tune, or otherwise improve an AI model, which is distributed or made available, such AI model may be subject to redistribution and use requirements in the [Qwen License Agreement](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507/blob/main/LICENSE) and the [DeepSeek License Agreement](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/LICENSE). ## Dataset Creation Process To create **Puzzle-KD-Nemotron-Post-Training-Dataset-v2**, the following modifications were applied upon the original Nemotron-Post-Training-Dataset-v2: - All relevant subsets were concatenated: `code`, `math`, `stem`, and `chat`. - All multilingual samples (Spanish, French, German, Italian, Japanese, etc.) were filtered out, retaining only entries in English. - Samples where the metadata field `reasoning` was set to `"on"` were removed, preserving only those with reasoning traces turned off. - The resulting dataset was split deterministically into: - 95% training set - 5% validation set These steps produce a streamlined dataset optimized for fine-tuning, evaluation, and multi-stage optimization processes within the Puzzle NAS pipeline. Code used to produce this dataset: ```python import datasets ds = datasets.load_dataset(dataset_path, split=["code", "math", "stem", "chat"]) ds = datasets.concatenate_datasets(ds) # Filter out samples with reasoning = on ds = ds.filter(lambda x: x["reasoning"] == "off") # Hardcoded for dynamically create a deterministic train-val split seed = 408 generator = np.random.RandomState(seed=seed) ds = ds.train_test_split(test_size=0.05, shuffle=True, generator=generator) ds = datasets.DatasetDict({ "train": ds["train"], "validation": ds["test"], }) ``` ## Intended Use The Puzzle-KD-Nemotron-Post-Training-Dataset-v2 dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train and evaluate LLMs, and is also suitable for use within Neural Architecture Search or pruning algorithms such as Puzzle. ## Data Access and Loading Example You can load the dataset directly using the Hugging Face `datasets` library as follows: ```python from datasets import load_dataset dataset = load_dataset("nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2") # To access the deterministic split: train_dataset = dataset["train"] val_dataset = dataset["validation"] ``` ## Dataset Quantification | Subset | Samples | |-------------|----------------| | train | 808775 | | validation | 42568 | The total number of samples in the dataset is 0.85M Storage size: 2.74 GB ### Release Date: <br> 11/13/2025 <br> ## Data Version 1.0 (10/16/2025) ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Data Opt-Out: NVIDIA has undertaken legal review to ensure there is no confidential, PII or copyright materials. If, when reviewing or using this dataset, you identify issues with the data itself, such as those listed above, please contact nemotron-data@nvidia.com. ## Citation If you found this dataset useful, please cite: ``` @software{PuzzleKDNemotronPostTraining-Datasetv2, author = {NVIDIA}, title = {Puzzle-KD-Nemotron-Post-Training-Dataset-v2}, year = {2025}, month = nov, url={https://huggingface.co/datasets/nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2}, } ```
提供机构:
nvidia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作