five

OpenScienceReasoning-2

收藏
魔搭社区2025-12-26 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/nv-community/OpenScienceReasoning-2
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Description OpenScienceReasoning-2 is a multi-domain synthetic dataset designed to improve general-purpose reasoning in large language models (LLMs). The dataset contains multiple-choice and open-ended question-answer pairs with detailed reasoning traces and spans across diverse scientific domains, including STEM, law, economics, and humanities. OpenScience aims to boost accuracy on advanced benchmarks such as GPQA-Diamond, MMLU-Pro and HLE via supervised finetuning or reinforcement learning. This dataset is ready for commercial use. This dataset is an updated release of the OpenScience dataset (https://huggingface.co/datasets/nvidia/OpenScience). It includes newly generated questions produced with model DeepSeek-R1-0528, and replaces the original solutions with those generated by DeepSeek-R1-0528 for consistency and improved quality. ## Dataset Owner(s) NVIDIA Corporation ## Dataset Creation Date 20/06/2025 ## License/Terms of Use Governing Terms: This dataset is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/legalcode.en). ## Intended Usage OpenScienceReasoning-2 is intended to be used by the community to continue to improve open models. The data may be freely used to train and evaluate. <br> ## Data Version - v2 ## Dataset Characterization Data Collection Method: *Synthetic <br> Labeling Method: *Synthetic <br> ## Dataset Format Text ## Dataset Quantification Record Count: 1.6M question-answer pairs Total Data Storage: 35GB ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

## 数据集说明 OpenScienceReasoning-2 是一款多领域合成数据集,旨在提升大语言模型(Large Language Model, LLM)的通用推理能力。该数据集包含选择题与开放式问答对,并附带详细推理链条,涵盖科学、技术、工程、数学(STEM)、法学、经济学与人文科学等多元科学领域。本数据集旨在通过监督微调或强化学习,提升模型在 GPQA-Diamond、MMLU-Pro 及 HLE 等高级基准测试中的表现准确率。 本数据集可用于商业用途。 本数据集是 OpenScience 数据集(https://huggingface.co/datasets/nvidia/OpenScience)的更新版本,新增了由 DeepSeek-R1-0528 模型生成的全新问题,并将原始解答替换为 DeepSeek-R1-0528 生成的内容,以保证一致性并提升内容质量。 ## 数据集所有者 英伟达公司(NVIDIA Corporation) ## 数据集创建日期 2025年6月20日 ## 许可/使用条款 管辖条款:本数据集采用知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License,https://creativecommons.org/licenses/by/4.0/legalcode.en)。 ## 预期用途 OpenScienceReasoning-2 旨在供社区用于持续优化开源模型,相关数据可免费用于模型训练与评估。 ## 数据版本 - v2 ## 数据集特征 数据收集方法: * 合成生成 <br> 标注方法: * 合成生成 <br> ## 数据集格式 文本 ## 数据集量化指标 记录数量:160万条问答对 总数据存储量:35GB ## 伦理考量 英伟达(NVIDIA)认为可信人工智能是一项共同责任,我们已建立相关政策与实践规范,以支撑各类人工智能应用的开发。开发者在遵循服务条款下载或使用本数据集时,应与其内部模型团队协作,确保所开发的模型符合相关行业与应用场景的要求,并应对可能出现的产品滥用问题。 请[在此](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交安全漏洞或英伟达人工智能相关问题。
提供机构:
maas
创建时间:
2025-07-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作