Nemotron-Pretraining-Dataset-sample

Name: Nemotron-Pretraining-Dataset-sample
Creator: maas
Published: 2026-01-08 13:01:28
License: 暂无描述

魔搭社区2026-01-08 更新2025-08-23 收录

下载链接：

https://modelscope.cn/datasets/nv-community/Nemotron-Pretraining-Dataset-sample

下载链接

链接失效反馈

官方服务：

资源简介：

# Nemotron-Pre-Training-Dataset-v1 Release ## Data Overview This pretraining dataset, for generative AI model training, preserves high-value math and code while enriching it with diverse multilingual Q&A, fueling the next generation of intelligent, globally-capable models. This dataset supports [NVIDIA Nemotron Nano 2](https://huggingface.co/collections/nvidia/nvidia-nemotron-689f6d6e6ead8e77dd641615), a family of large language models (LLMs) that consists of the [NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2), [NVIDIA-Nemotron-Nano-9B-v2-Base](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2), and [NVIDIA-Nemotron-Nano-12B-v2-Base](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base) models. They are successors of [Nemotron-H-8B-Base-8K](https://huggingface.co/nvidia/Nemotron-H-8B-Base-8K) and [Nemotron-H-8B-Reasoning-128K](https://huggingface.co/nvidia/Nemotron-H-8B-Reasoning-128K), created with commercial use in mind. The NVIDIA-Nemotron-Nano-9B-v2 model is aligned for human chat preferences and tasks. All of the NVIDIA Nemotron Nano 2 models support a context length of 128K tokens. Our dataset comes in 4 main categories: - [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample) - This dataset includes a small sampled version for inspection and quick experimentation, with 10 representative subsets drawn from different components of the full SFT and pretraining corpora. These include diverse QA data (original and translated), high-quality and synthetic high-quality Common Crawl extractions, math-focused subsets, code metadata, and SFT-style data across code, math, and general domains, as well as synthetic code. - [nvidia/Nemotron-CC-Math-v1](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1) - 133B-token high-quality math pretraining dataset from Common Crawl built with a novel Lynx + LLM pipeline that preserves equations and code, standardizes to LaTeX, and removes noise, beating all previous math pretraining datasets on math and improves on code, and reasoning benchmarks. We also regenerated the Nemotron-MIND dataset using Nemotron-cc-math-4plus, our high-quality subset which yielded consistent gains over previous nemotron-MIND. - [nvidia/Nemotron-CC-v2](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) - Updated English web crawl dataset based on Nemotron-CC with eight additional Common Crawl snapshots (2024–2025), synthetic rephrasing using Qwen3-30B-A3B, filtered for English and globally deduplicated. Includes synthetic data generated with five different prompts. The synthetic Diverse QA data has also been translated into 15 languages. - [nvidia/Nemotron-Pretraining-Code-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Code-v1) - Large-scale curated source code dataset from GitHub, processed through multi-stage filtering including license-based removal (BigCode-inspired, with a stricter license set), exact and fuzzy deduplication, and heuristic quality filters from OpenCoder. All files are annotated with metadata to guide filtering and improve dataset quality. Additionally, we generate large-scale code question–answer data in 11 programming languages by prompting LLMs on curated code snippets, solving the generated problems, and filtering results for correctness, producing diverse natural language–code pairs for pretraining. - [nvidia/Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) - Diverse synthetically generated and curated SFT-style dataset spanning STEM, multilingual, academic, and reasoning domains. STEM data was expanded from high-quality math and science seeds using multi-iteration generation with Qwen3 and DeepSeek models, producing varied, harder, and multiple-choice questions with solutions. Academic QA pairs were synthesized from complex undergraduate- and graduate-level texts. Additional SFT-style data covers code, math, MMLU-style general QA, and fundamental reasoning tasks, with billions of tokens generated using DeepSeek-V3 and Qwen3 for logical, analytical, and reading comprehension questions. ## Data distribution The total data category distribution are as follows: | Dataset Category | Tokens Count (B) | |------------------|------------------| | English Common Crawl | 3359.8 | | English Synthetic CC | 1257.3 | | Diverse QA | 692.9 | | Translated Diverse QA | 558.2 | | Math | 206.2 | | Math SFT | 190.6 | | Synthetic Code | 174.9 | | Code SFT | 58.5 | | General SFT | 87.5 | | **TOTAL** | **6585.8** | Additionally, we release metadata to reproduce a 747.4B token curated code dataset. ## Filtering the data Users can download subsets of the data based on the metadata schema described above. Example script for downloading code and math as follows: ``` from datasets import load_dataset ds = load_dataset("nvidia/Nemotron-CC-Math-v1", "4plus", streaming=True) ``` Models that were used in the creation of this dataset per category are as follows: **nvidia/Nemotron-CC-Math-v1** | Model | Token Count (B) | |-------|-----------------| | [phi-4](https://huggingface.co/microsoft/phi-4) | 206.2 | **nvidia/Nemotron-CC-v2** | Model | Token Count (B) | |-------|-----------------| | [Mistral-Nemo-12B-Instruct](https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct) | 1629.1 | | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | 879.1 | | Without using LLM | 3359.8 | **nvidia/Nemotron-Pretraining-Code-v1** | Model | Token Count (B) | |-------|-----------------| | [Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1) | 174.9 | **nvidia/Nemotron-Pretraining-SFT-v1** | Model | Token Count | |--------|-------------| | [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) | 100.8 B | | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | 59.8 B | | [Qwen2.5-Math-72B](https://huggingface.co/Qwen/Qwen2.5-Math-72B) | 55.7 B | | [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) | 41.6 B | | [Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1) | 17.6 B | | [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) | 15.6 B | | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | 15.2 B | | [Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) | 7.4 B | | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | 7.1 B | | [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | 4.1 B | | Nemotron 340B | 2.1 B | | [DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) | 2.1 B | | [DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324) | 2.0 B | | Nemotron 4 340B | 2.0 B | | [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | 1.5 B | | [Qwen2.5-0.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | 1.5 B | | [Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) | 343.9 M | | [Qwen2.5-72B](https://huggingface.co/Qwen/Qwen2.5-72B) | 75 M | | [Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1) | 31.1 M | | Without using LLM | 3.9 M | ## License/Terms of Use [NVIDIA Open Data License Agreement](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample/raw/main/LICENSE.md) This dataset contains synthetic data created using the following models: DeepSeek-R1, DeepSeek-R1-0528, DeepSeek-R1-Distill-Qwen-32B, DeepSeek-V3, DeepSeek-V3-0324, Mistral-Nemo-12B-Instruct, Mixtral 8x22B, Mixtral-8x22B-v0.1, Nemotron-4-340B-Instruct, Qwen2.5-32B-Instruct, Qwen2.5-72B-Instruct, Qwen-2.5-7B-Math-Instruct, Qwen2.5-0.5B-instruct, Qwen2.5-32B-Instruct, Qwen2.5-72B-Instruct, Qwen2.5-Coder-32B-Instruct, Qwen2.5-Math-72B, Qwen3-235B-A22B, Qwen3-30B-A3B If this dataset is used to create, train, fine-tune, or otherwise improve an AI model, which is distributed or made available, such AI model may be subject to redistribution and use requirements in the [Qwen License Agreement](https://huggingface.co/Qwen/Qwen2.5-72B/blob/main/LICENSE) and the [DeepSeek License Agreement](https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/LICENSE-MODEL). **Data Developer:** NVIDIA ### Use Case: Developers training foundation LLM models. ### Release Date: 8/18/2025 ## Data Version 1.0 (8/18/2025) ## Intended use The Nemotron Pre-Training Dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train and evaluate with user agreement to open data license. ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Data Opt-Out: NVIDIA has undertaken legal review to ensure there is no confidential, PII or copyright materials. If, when reviewing or using this dataset, you identify issues with the data itself, such as those listed above, please contact nemotron-data@nvidia.com. ## Citation & Acknowledgment If you use our dataset in your research, please cite our [NVIDIA Nemotron Nano 2 paper](https://arxiv.org/abs/2508.14444): ```bibtex @misc{nvidia2025nvidianemotronnano2, title={NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model}, author={NVIDIA and : and Aarti Basant and Abhijit Khairnar and Abhijit Paithankar and Abhinav Khattar and Adithya Renduchintala and Aditya Malte and Akhiad Bercovich and Akshay Hazare and Alejandra Rico and Aleksander Ficek and Alex Kondratenko and Alex Shaposhnikov and Alexander Bukharin and Ali Taghibakhshi and Amelia Barton and Ameya Sunil Mahabaleshwarkar and Amy Shen and Andrew Tao and Ann Guan and Anna Shors and Anubhav Mandarwal and Arham Mehta and Arun Venkatesan and Ashton Sharabiani and Ashwath Aithal and Ashwin Poojary and Ayush Dattagupta and Balaram Buddharaju and Banghua Zhu and Barnaby Simkin and Bilal Kartal and Bita Darvish Rouhani and Bobby Chen and Boris Ginsburg and Brandon Norick and Brian Yu and Bryan Catanzaro and Charles Wang and Charlie Truong and Chetan Mungekar and Chintan Patel and Chris Alexiuk and Christian Munley and Christopher Parisien and Dan Su and Daniel Afrimi and Daniel Korzekwa and Daniel Rohrer and Daria Gitman and David Mosallanezhad and Deepak Narayanan and Dima Rekesh and Dina Yared and Dmytro Pykhtar and Dong Ahn and Duncan Riach and Eileen Long and Elliott Ning and Eric Chung and Erick Galinkin and Evelina Bakhturina and Gargi Prasad and Gerald Shen and Haifeng Qian and Haim Elisha and Harsh Sharma and Hayley Ross and Helen Ngo and Herman Sahota and Hexin Wang and Hoo Chang Shin and Hua Huang and Iain Cunningham and Igor Gitman and Ivan Moshkov and Jaehun Jung and Jan Kautz and Jane Polak Scowcroft and Jared Casper and Jian Zhang and Jiaqi Zeng and Jimmy Zhang and Jinze Xue and Jocelyn Huang and Joey Conway and John Kamalu and Jonathan Cohen and Joseph Jennings and Julien Veron Vialard and Junkeun Yi and Jupinder Parmar and Kari Briski and Katherine Cheung and Katherine Luna and Keith Wyss and Keshav Santhanam and Kezhi Kong and Krzysztof Pawelec and Kumar Anik and Kunlun Li and Kushan Ahmadian and Lawrence McAfee and Laya Sleiman and Leon Derczynski and Luis Vega and Maer Rodrigues de Melo and Makesh Narsimhan Sreedhar and Marcin Chochowski and Mark Cai and Markus Kliegl and Marta Stepniewska-Dziubinska and Matvei Novikov and Mehrzad Samadi and Meredith Price and Meriem Boubdir and Michael Boone and Michael Evans and Michal Bien and Michal Zawalski and Miguel Martinez and Mike Chrzanowski and Mohammad Shoeybi and Mostofa Patwary and Namit Dhameja and Nave Assaf and Negar Habibi and Nidhi Bhatia and Nikki Pope and Nima Tajbakhsh and Nirmal Kumar Juluru and Oleg Rybakov and Oleksii Hrinchuk and Oleksii Kuchaiev and Oluwatobi Olabiyi and Pablo Ribalta and Padmavathy Subramanian and Parth Chadha and Pavlo Molchanov and Peter Dykas and Peter Jin and Piotr Bialecki and Piotr Januszewski and Pradeep Thalasta and Prashant Gaikwad and Prasoon Varshney and Pritam Gundecha and Przemek Tredak and Rabeeh Karimi Mahabadi and Rajen Patel and Ran El-Yaniv and Ranjit Rajan and Ria Cheruvu and Rima Shahbazyan and Ritika Borkar and Ritu Gala and Roger Waleffe and Ruoxi Zhang and Russell J. Hewett and Ryan Prenger and Sahil Jain and Samuel Kriman and Sanjeev Satheesh and Saori Kaji and Sarah Yurick and Saurav Muralidharan and Sean Narenthiran and Seonmyeong Bak and Sepehr Sameni and Seungju Han and Shanmugam Ramasamy and Shaona Ghosh and Sharath Turuvekere Sreenivas and Shelby Thomas and Shizhe Diao and Shreya Gopal and Shrimai Prabhumoye and Shubham Toshniwal and Shuoyang Ding and Siddharth Singh and Siddhartha Jain and Somshubra Majumdar and Soumye Singhal and Stefania Alborghetti and Syeda Nahida Akter and Terry Kong and Tim Moon and Tomasz Hliwiak and Tomer Asida and Tony Wang and Tugrul Konuk and Twinkle Vashishth and Tyler Poon and Udi Karpas and Vahid Noroozi and Venkat Srinivasan and Vijay Korthikanti and Vikram Fugro and Vineeth Kalluru and Vitaly Kurin and Vitaly Lavrukhin and Wasi Uddin Ahmad and Wei Du and Wonmin Byeon and Ximing Lu and Xin Dong and Yashaswi Karnati and Yejin Choi and Yian Zhang and Ying Lin and Yonggan Fu and Yoshi Suhara and Zhen Dong and Zhiyu Li and Zhongbo Zhu and Zijia Chen}, year={2025}, eprint={2508.14444}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.14444}, } ```

# Nemotron预训练数据集v1 发布 ## 数据概览本数据集用于生成式AI模型训练，保留高价值的数学与代码数据，并通过多样化的多语言问答数据进行丰富，为下一代具备全球智能能力的模型提供训练支撑。本数据集适配NVIDIA Nemotron Nano 2系列大语言模型（Large Language Model，LLM），该系列包含NVIDIA-Nemotron-Nano-9B-v2、NVIDIA-Nemotron-Nano-9B-v2-Base以及NVIDIA-Nemotron-Nano-12B-v2-Base三款模型。它们是Nemotron-H-8B-Base-8K与Nemotron-H-8B-Reasoning-128K的迭代版本，专为商业使用场景开发。其中NVIDIA-Nemotron-Nano-9B-v2模型针对人类对话偏好与任务进行了对齐。所有NVIDIA Nemotron Nano 2系列模型均支持128K Token的上下文长度。本数据集包含五大核心类别： - nvidia/Nemotron-Pretraining-Dataset-sample 该数据集包含小型采样版本，用于快速实验与数据检视，从完整监督微调（Supervised Fine-Tuning, SFT）语料与预训练语料的不同组件中抽取了10个代表性子集。涵盖多样化问答数据（原始数据与翻译数据）、高质量及合成高质量的通用网页抓取（Common Crawl）抽取数据、数学专项子集、代码元数据、覆盖代码、数学与通用领域的监督微调风格数据，以及合成代码数据。 - nvidia/Nemotron-CC-Math-v1 该数据集为133B Token的高质量数学预训练语料，源自Common Crawl，通过创新的Lynx+大语言模型流水线构建，可保留公式与代码数据，标准化为LaTeX格式并移除噪声数据。在数学基准测试上优于此前所有数学预训练数据集，同时在代码与推理基准测试上也实现了性能提升。我们还使用Nemotron-cc-math-4plus（本团队的高质量子集）重新生成了Nemotron-MIND数据集，该数据集相较于此前的Nemotron-MIND实现了稳定的性能增益。 - nvidia/Nemotron-CC-v2 该数据集为更新版英文网页抓取数据集，基于Nemotron-CC构建，新增了8份2024-2025年的Common Crawl快照数据。通过Qwen3-30B-A3B模型进行合成重写，仅保留英文数据并进行全局去重。包含使用五种不同提示生成的合成数据，其中多样化合成问答数据已被翻译为15种语言。 - nvidia/Nemotron-Pretraining-Code-v1 该数据集为大规模精选的GitHub源代码数据集，经过多阶段过滤处理：包括基于许可证的移除（参考BigCode项目，采用更严格的许可证集合）、精确与模糊去重，以及来自OpenCoder的启发式质量过滤。所有文件均标注了元数据，以辅助过滤与提升数据集质量。此外，我们通过对精选代码片段调用大语言模型生成问题、求解生成的问题并过滤结果正确性，生成了覆盖11种编程语言的大规模代码问答数据，构建了多样化的自然语言-代码配对语料用于预训练。 - nvidia/Nemotron-Pretraining-SFT-v1 该数据集为多样化的合成生成与精选的监督微调风格数据集，覆盖科学、技术、工程与数学（Science, Technology, Engineering, Mathematics, STEM）、多语言、学术与推理领域。STEM数据从高质量的数学与科学种子数据出发，通过Qwen3与DeepSeek模型的多轮迭代生成扩展，生成了多样化的高难度选择题与配套解答。学术问答对从复杂的本科及研究生级文本中合成生成。额外的监督微调风格数据覆盖代码、数学、大规模多任务语言理解（MMLU）风格通用问答与基础推理任务，使用DeepSeek-V3与Qwen3生成了数十亿Token的逻辑、分析与阅读理解类问题。 ## 数据分布各数据集类别的Token数量分布如下： | 数据集类别 | Token 数量（十亿） | |--------------------------|-------------------| | 英文通用网页抓取数据 | 3359.8 | | 英文合成通用网页抓取数据 | 1257.3 | | 多样化问答数据 | 692.9 | | 翻译后多样化问答数据 | 558.2 | | 数学数据 | 206.2 | | 数学监督微调数据 | 190.6 | | 合成代码数据 | 174.9 | | 代码监督微调数据 | 58.5 | | 通用监督微调数据 | 87.5 | | **总计** | **6585.8** | 此外，我们还发布了元数据，用于复现一个747.4B Token的精选代码数据集。 ## 数据筛选用户可基于上述元数据架构下载所需的数据子集。以下为下载代码与数学数据的示例脚本： from datasets import load_dataset ds = load_dataset("nvidia/Nemotron-CC-Math-v1", "4plus", streaming=True) 各数据集类别构建过程中使用的模型及对应Token数量如下： **nvidia/Nemotron-CC-Math-v1** | 模型 | Token 数量（十亿） | |-------|-----------------| | phi-4 | 206.2 | **nvidia/Nemotron-CC-v2** | 模型 | Token 数量（十亿） | |-------|-----------------| | Mistral-Nemo-12B-Instruct | 1629.1 | | Qwen3-30B-A3B | 879.1 | | 未使用大语言模型 | 3359.8 | **nvidia/Nemotron-Pretraining-Code-v1** | 模型 | Token 数量（十亿） | |-------|-----------------| | Mixtral-8x22B-v0.1 | 174.9 | **nvidia/Nemotron-Pretraining-SFT-v1** | 模型 | Token 数量 | |--------|-------------| | DeepSeek-R1-0528 | 100.8 B | | DeepSeek-R1 | 59.8 B | | Qwen2.5-Math-72B | 55.7 B | | Qwen2.5-32B-Instruct | 41.6 B | | Mixtral-8x22B-v0.1 | 17.6 B | | Qwen2.5-32B-Instruct | 15.6 B | | Qwen3-30B-A3B | 15.2 B | | Qwen2.5-Coder-32B-Instruct | 7.4 B | | Qwen3-235B-A22B | 7.1 B | | Qwen2.5-72B-Instruct | 4.1 B | | Nemotron 340B | 2.1 B | | DeepSeek-V3 | 2.1 B | | DeepSeek-V3-0324 | 2.0 B | | Nemotron 4 340B | 2.0 B | | DeepSeek-R1-Distill-Qwen-32B | 1.5 B | | Qwen2.5-0.5B-instruct | 1.5 B | | Qwen2.5-Math-7B-Instruct | 343.9 M | | Qwen2.5-72B | 75 M | | Mixtral-8x22B-v0.1 | 31.1 M | | 未使用大语言模型 | 3.9 M | ## 使用许可与条款 [NVIDIA开放数据许可协议](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample/raw/main/LICENSE.md) 本数据集包含使用以下模型生成的合成数据： DeepSeek-R1、DeepSeek-R1-0528、DeepSeek-R1-Distill-Qwen-32B、DeepSeek-V3、DeepSeek-V3-0324、Mistral-Nemo-12B-Instruct、Mixtral 8x22B、Mixtral-8x22B-v0.1、Nemotron-4-340B-Instruct、Qwen2.5-32B-Instruct、Qwen2.5-72B-Instruct、Qwen-2.5-7B-Math-Instruct、Qwen2.5-0.5B-instruct、Qwen2.5-32B-Instruct、Qwen2.5-72B-Instruct、Qwen2.5-Coder-32B-Instruct、Qwen2.5-Math-72B、Qwen3-235B-A22B、Qwen3-30B-A3B。若使用本数据集创建、训练、微调或以其他方式改进并进行分发或公开的AI模型，则该AI模型需遵守[Qwen许可协议](https://huggingface.co/Qwen/Qwen2.5-72B/blob/main/LICENSE)与[DeepSeek许可协议](https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/LICENSE-MODEL)中的分发与使用要求。 **数据开发者：** NVIDIA ### 使用场景： 用于训练基础大语言模型的开发者。 ### 发布日期： 2025年8月18日 ## 数据版本 1.0（2025年8月18日） ## 预期用途 Nemotron预训练数据集旨在供社区用于持续改进开源模型。用户可在遵守开放数据许可协议的前提下，自由使用本数据集进行模型训练与评估。 ## 伦理考量 NVIDIA坚信可信AI是一项共同责任，我们已建立相关政策与实践规范，以支持各类AI应用的开发。开发者在依照本服务条款下载或使用本数据集时，应与内部模型团队协作，确保该模型符合相关行业与使用场景的要求，并应对潜在的产品滥用问题。请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)报告安全漏洞或NVIDIA AI相关问题。 ## 数据退出机制 NVIDIA已完成法律审查，确保本数据集不包含机密信息、个人可识别信息（Personally Identifiable Information, PII）或受版权保护的材料。若您在检视或使用本数据集时发现上述或其他相关问题，请联系nemotron-data@nvidia.com。 ## 引用与致谢若您在研究中使用本数据集，请引用我们的[NVIDIA Nemotron Nano 2论文](https://arxiv.org/abs/2508.14444)： bibtex @misc{nvidia2025nvidianemotronnano2, title={NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model}, author={NVIDIA and : and Aarti Basant and Abhijit Khairnar and Abhijit Paithankar and Abhinav Khattar and Adithya Renduchintala and Aditya Malte and Akhiad Bercovich and Akshay Hazare and Alejandra Rico and Aleksander Ficek and Alex Kondratenko and Alex Shaposhnikov and Alexander Bukharin and Ali Taghibakhshi and Amelia Barton and Ameya Sunil Mahabaleshwarkar and Amy Shen and Andrew Tao and Ann Guan and Anna Shors and Anubhav Mandarwal and Arham Mehta and Arun Venkatesan and Ashton Sharabiani and Ashwath Aithal and Ashwin Poojary and Ayush Dattagupta and Balaram Buddharaju and Banghua Zhu and Barnaby Simkin and Bilal Kartal and Bita Darvish Rouhani and Bobby Chen and Boris Ginsburg and Brandon Norick and Brian Yu and Bryan Catanzaro and Charles Wang and Charlie Truong and Chetan Mungekar and Chintan Patel and Chris Alexiuk and Christian Munley and Christopher Parisien and Dan Su and Daniel Afrimi and Daniel Korzekwa and Daniel Rohrer and Daria Gitman and David Mosallanezhad and Deepak Narayanan and Dima Rekesh and Dina Yared and Dmytro Pykhtar and Dong Ahn and Duncan Riach and Eileen Long and Elliott Ning and Eric Chung and Erick Galinkin and Evelina Bakhturina and Gargi Prasad and Gerald Shen and Haifeng Qian and Haim Elisha and Harsh Sharma and Hayley Ross and Helen Ngo and Herman Sahota and Hexin Wang and Hoo Chang Shin and Hua Huang and Iain Cunningham and Igor Gitman and Ivan Moshkov and Jaehun Jung and Jan Kautz and Jane Polak Scowcroft and Jared Casper and Jian Zhang and Jiaqi Zeng and Jimmy Zhang and Jinze Xue and Jocelyn Huang and Joey Conway and John Kamalu and Jonathan Cohen and Joseph Jennings and Julien Veron Vialard and Junkeun Yi and Jupinder Parmar and Kari Briski and Katherine Cheung and Katherine Luna and Keith Wyss and Keshav Santhanam and Kezhi Kong and Krzysztof Pawelec and Kumar Anik and Kunlun Li and Kushan Ahmadian and Lawrence McAfee and Laya Sleiman and Leon Derczynski and Luis Vega and Maer Rodrigues de Melo and Makesh Narsimhan Sreedhar and Marcin Chochowski and Mark Cai and Markus Kliegl and Marta Stepniewska-Dziubinska and Matvei Novikov and Mehrzad Samadi and Meredith Price and Meriem Boubdir and Michael Boone and Michael Evans and Michal Bien and Michal Zawalski and Miguel Martinez and Mike Chrzanowski and Mohammad Shoeybi and Mostofa Patwary and Namit Dhameja and Nave Assaf and Negar Habibi and Nidhi Bhatia and Nikki Pope and Nima Tajbakhsh and Nirmal Kumar Juluru and Oleg Rybakov and Oleksii Hrinchuk and Oleksii Kuchaiev and Oluwatobi Olabiyi and Pablo Ribalta and Padmavathy Subramanian and Parth Chadha and Pavlo Molchanov and Peter Dykas and Peter Jin and Piotr Bialecki and Piotr Januszewski and Pradeep Thalasta and Prashant Gaikwad and Prasoon Varshney and Pritam Gundecha and Przemek Tredak and Rabeeh Karimi Mahabadi and Rajen Patel and Ran El-Yaniv and Ranjit Rajan and Ria Cheruvu and Rima Shahbazyan and Ritika Borkar and Ritu Gala and Roger Waleffe and Ruoxi Zhang and Russell J. Hewett and Ryan Prenger and Sahil Jain and Samuel Kriman and Sanjeev Satheesh and Saori Kaji and Sarah Yurick and Saurav Muralidharan and Sean Narenthiran and Seonmyeong Bak and Sepehr Sameni and Seungju Han and Shanmugam Ramasamy and Shaona Ghosh and Sharath Turuvekere Sreenivas and Shelby Thomas and Shizhe Diao and Shreya Gopal and Shrimai Prabhumoye and Shubham Toshniwal and Shuoyang Ding and Siddharth Singh and Siddhartha Jain and Somshubra Majumdar and Soumye Singhal and Stefania Alborghetti and Syeda Nahida Akter and Terry Kong and Tim Moon and Tomasz Hliwiak and Tomer Asida and Tony Wang and Tugrul Konuk and Twinkle Vashishth and Tyler Poon and Udi Karpas and Vahid Noroozi and Venkat Srinivasan and Vijay Korthikanti and Vikram Fugro and Vineeth Kalluru and Vitaly Kurin and Vitaly Lavrukhin and Wasi Uddin Ahmad and Wei Du and Wonmin Byeon and Ximing Lu and Xin Dong and Yashaswi Karnati and Yejin Choi and Yian Zhang and Ying Lin and Yonggan Fu and Yoshi Suhara and Zhen Dong and Zhiyu Li and Zhongbo Zhu and Zijia Chen}, year={2025}, eprint={2508.14444}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.14444}, }

提供机构：

maas

创建时间：

2025-08-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集