xuejgy/Nemotron-Science-v1

Name: xuejgy/Nemotron-Science-v1
Creator: xuejgy
Published: 2026-03-17 09:05:15
License: 暂无描述

Hugging Face2026-03-17 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/xuejgy/Nemotron-Science-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en configs: - config_name: default data_files: - split: MCQ path: data/MCQ.jsonl - split: RQA path: data/RQA.jsonl --- ## Dataset Description: Nemotron-Science-v1 is a synthetic science reasoning dataset with two subsets: an MCQA set that improves on the STEM portion of Nemotron-Post-Training-v1 using GPT-OSS-120B to generate GPQA-style questions and reasoning traces, and an RQA set of synthetic chemistry questions. This dataset is ready for commercial use. The Nemotron-Science-v1 dataset contains the following subsets: ### MCQA This subset is an improvement of the STEM subset in [Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1) with reasoning traces generated with GPT-OSS-120B. The dataset consists of synthetic science questions designed to mimic GPQA-style topics and subtopics, generated to enhance large language model reasoning capabilities in scientific domains. ### RQA This dataset consists of synthetic chemistry questions, generated to enhance large language model reasoning capabilities in scientific domains. ## Dataset Owner(s): NVIDIA Corporation ## Dataset Creation Date: Created on: Dec 3, 2025 Last Modified on: Dec 3, 2025 ## License/Terms of Use: This dataset is governed by the [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). ## Intended Usage: This dataset is intended for LLM engineers and research teams developing and training large language models with a focus on improving scientific reasoning and problem-solving capabilities. It is suitable for supervised training and data augmentation in science-based model development pipelines. ## Dataset Characterization **Data Collection Method** Synthetic - LLM-generated scientific question and solution pairs **Labeling Method** Synthetic - Model-generated solutions and annotations ## Dataset Format Modality: Text Format: JSONL Structure: Text + Metadata ## Dataset Quantification | Subset | Samples | |--------|---------| | MCQA | 174,155 | | RQA | 52,179 | | Total | 226,334 | Total Disk Size: ~2.5 GB ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer teams to ensure this dataset meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)

许可证：CC BY 4.0 语言：英语配置项： - 配置名称：默认配置数据文件： - 拆分集：MCQ 路径：data/MCQ.jsonl - 拆分集：RQA 路径：data/RQA.jsonl ## 数据集描述： Nemotron-Science-v1 是一款合成式科学推理数据集，包含两个子集：其一为MCQA子集，通过GPT-OSS-120B生成GPQA风格的问题与推理轨迹，对Nemotron-Post-Training-v1的STEM（科学、技术、工程、数学）部分进行优化；其二为RQA子集，涵盖合成化学问题。本数据集可商用。 Nemotron-Science-v1 数据集包含以下子集： ### MCQA子集该子集基于[Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1)的STEM子集优化而来，由GPT-OSS-120B生成推理轨迹。本数据集包含模拟GPQA风格主题与子主题的合成科学问题，旨在提升大语言模型（Large Language Model，LLM）在科学领域的推理能力。 ### RQA子集本数据集包含合成化学问题，用于提升大语言模型在科学领域的推理能力。 ## 数据集所有者：英伟达（NVIDIA）公司 ## 数据集创建日期：创建日期：2025年12月3日最后修改日期：2025年12月3日 ## 许可证与使用条款：本数据集受[知识共享署名4.0国际许可协议（Creative Commons Attribution 4.0 International License，CC BY 4.0）](https://creativecommons.org/licenses/by/4.0/)约束。 ## 预期用途：本数据集面向致力于提升大语言模型科学推理与问题解决能力的LLM工程师与研究团队，适用于科学类模型开发流程中的监督训练与数据增强任务。 ## 数据集特征 **数据采集方式** 合成生成：由大语言模型生成的科学问题与解答对 **标注方式** 合成生成：由模型生成解答与标注信息 ## 数据集格式模态：文本格式：JSONL 结构：文本 + 元数据 ## 数据集规模 | 子集 | 样本量 | |------|--------| | MCQA | 174,155 | | RQA | 52,179 | | 总计 | 226,334 | 总磁盘占用：约2.5 GB ## 伦理考量：英伟达（NVIDIA）认为，可信人工智能是一项共同责任，我们已制定相关政策与实践规范，以支持各类人工智能应用的开发。开发者在按照本服务条款下载或使用本数据集时，应与内部开发团队协作，确保本数据集符合相关行业与应用场景的要求，并防范可能出现的产品误用情况。若需反馈质量问题、风险、安全漏洞或英伟达人工智能相关问题，请[点击此处](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交。

提供机构：

xuejgy

5,000+

优质数据集

54 个

任务类型

进入经典数据集