Nemotron-Post-Training-Dataset-v1
收藏Nemotron-Post-Training-Dataset-v1 数据集概述
数据集基本信息
- 所有者: NVIDIA Corporation
- 创建日期: 07/15/2025
- 发布日期: 7/31/2025
- 数据版本: 1 (7/31/2025)
- 许可证: Creative Commons Attribution 4.0 International License (CC BY 4.0)
- 下载大小: 203373185595 bytes
- 数据集大小: 510313687949 bytes
数据集特征
- 特征:
uuid: 字符串类型license: 字符串类型generator: 字符串类型version: 字符串类型category: 字符串类型reasoning: 字符串类型messages: 列表类型,包含role、content和tool_calls字段metadata: 字符串类型
数据分布
| 类别 | 样本数量 |
|---|---|
| chat | 746,622 |
| code | 1,896,395 |
| math | 2,044,407 |
| stem | 20,662,167 |
| tool_calling | 310,051 |
| 总计 | 25,659,642 |
数据来源与生成
- 数据收集方法: 合成
- 标注方法: 合成
- 生成模型:
- DeepSeek-R1-0528: 24,602,969 样本
- Qwen3-235B-A22B: 1,056,673 样本
推荐训练格式
- chat: 用于对话调优,输入字段代表用户的回合。
- code: 请求解释和完整代码块。
- math: 提供逐步解决方案和最终答案。
- stem: 提供详细的分步答案。
- tool_calling: 根据模型的工具调用模板格式化。
预期用途
- 用于改进开放模型,训练和评估AI代理系统、聊天机器人、RAG系统等AI应用。
伦理考虑
- NVIDIA已进行法律审查,确保数据不包含机密、个人身份信息或版权材料。
引用
bibtex @misc{bercovich2025llamanemotronefficientreasoningmodels, title={Llama-Nemotron: Efficient Reasoning Models}, author={Akhiad Bercovich and Itay Levy and Izik Golan and Mohammad Dabbah and Ran El-Yaniv and Omri Puny and Ido Galil and Zach Moshe and Tomer Ronen and Najeeb Nabwani and Ido Shahaf and Oren Tropp and Ehud Karpas and Ran Zilberstein and Jiaqi Zeng and Soumye Singhal and Alexander Bukharin and Yian Zhang and Tugrul Konuk and Gerald Shen and Ameya Sunil Mahabaleshwarkar and Bilal Kartal and Yoshi Suhara and Olivier Delalleau and Zijia Chen and Zhilin Wang and David Mosallanezhad and Adi Renduchintala and Haifeng Qian and Dima Rekesh and Fei Jia and Somshubra Majumdar and Vahid Noroozi and Wasi Uddin Ahmad and Sean Narenthiran and Aleksander Ficek and Mehrzad Samadi and Jocelyn Huang and Siddhartha Jain and Igor Gitman and Ivan Moshkov and Wei Du and Shubham Toshniwal and George Armstrong and Branislav Kisacanin and Matvei Novikov and Daria Gitman and Evelina Bakhturina and Jane Polak Scowcroft and John Kamalu and Dan Su and Kezhi Kong and Markus Kliegl and Rabeeh Karimi and Ying Lin and Sanjeev Satheesh and Jupinder Parmar and Pritam Gundecha and Brandon Norick and Joseph Jennings and Shrimai Prabhumoye and Syeda Nahida Akter and Mostofa Patwary and Abhinav Khattar and Deepak Narayanan and Roger Waleffe and Jimmy Zhang and Bor-Yiing Su and Guyue Huang and Terry Kong and Parth Chadha and Sahil Jain and Christine Harvey and Elad Segal and Jining Huang and Sergey Kashirsky and Robert McQueen and Izzy Putterman and George Lam and Arun Venkatesan and Sherry Wu and Vinh Nguyen and Manoj Kilaru and Andrew Wang and Anna Warno and Abhilash Somasamudramath and Sandip Bhaskar and Maka Dong and Nave Assaf and Shahar Mor and Omer Ullman Argov and Scot Junkin and Oleksandr Romanenko and Pedro Larroy and Monika Katariya and Marco Rovinelli and Viji Balas and Nicholas Edelman and Anahita Bhiwandiwalla and Muthu Subramaniam and Smita Ithape and Karthik Ramamoorthy and Yuting Wu and Suguna Varshini Velury and Omri Almog and Joyjit Daw and Denys Fridman and Erick Galinkin and Michael Evans and Katherine Luna and Leon Derczynski and Nikki Pope and Eileen Long and Seth Schneider and Guillermo Siman and Tomasz Grzegorzek and Pablo Ribalta and Monika Katariya and Joey Conway and Trisha Saar and Ann Guan and Krzysztof Pawelec and Shyamala Prayaga and Oleksii Kuchaiev and Boris Ginsburg and Oluwatobi Olabiyi and Kari Briski and Jonathan Cohen and Bryan Catanzaro and Jonah Alben and Yonatan Geifman and Eric Chung and Chris Alexiuk}, year={2025}, eprint={2505.00949}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.00949}, }




