MegaScience
收藏魔搭社区2025-12-26 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/MegaScience/MegaScience
下载链接
链接失效反馈官方服务:
资源简介:
# [MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning](https://arxiv.org/abs/2507.16812)
**Code:** https://github.com/GAIR-NLP/MegaScience
**Project Page:** https://huggingface.co/MegaScience
MegaScience is a large-scale mixture of high-quality open-source datasets consisting of 1.25 million instances. We first collect multiple public datasets, then conduct comprehensive ablation studies across different data selection methods to identify the optimal approach for each dataset, thereby contributing high-quality subsets. Furthermore, we annotate step-by-step solutions for all datasets except TextbookReasoning.
We train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which outperform the official instruct models in average scientific reasoning performance, successfully advancing the frontiers of the open-source community in the science domain. We find that MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific instruction tuning.

## Sample Usage
To use the MegaScience dataset, you can clone the repository with Git LFS:
```bash
git lfs install
git clone https://huggingface.co/datasets/MegaScience/MegaScience
```
For more detailed information on data processing, supervised fine-tuning, and evaluation, please refer to the comprehensive guides on the [MegaScience GitHub repository](https://github.com/GAIR-NLP/MegaScience).
## The Curation of MegaScience

**Step 1**: Curate source data from NaturalReasoning, Nemotron-Science, and TextbookReasoning.
**Step 2**: Perform question deduplication and LLM-based decontamination.
**Step 3**: Conduct comprehensive ablation studies across different data selection methods to identify the optimal approach for each dataset, thereby contributing high-quality subsets.
**Step 4**: Annotate step-by-step solutions for NaturalReasoning and Nemotron-Science using DeepSeek-V3.
## Demonstration of Data Quality
Models trained on MegaScience significantly outperform their respective official Instruct counterparts on scientific reasoning tasks. Notably, MegaScience-trained models consistently surpass the strong Qwen3-Instruct baselines, even when fine-tuned on the state-of-the-art Qwen3 models. Furthermore, MegaScience exhibits strong scalability: as the base model size increases, the performance gains from MegaScience become more pronounced.
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="instruct_megascience_comparsion.png" alt="Results" style="width:80%;">
</div>
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="push_results.png" alt="Results" style="width:80%;">
</div>
## Citation
Check out our [paper](https://arxiv.org/abs/2507.16812) for more details. If you use our dataset or find our work useful, please cite
```
@article{fan2025megascience,
title={MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning},
author={Fan, Run-Ze and Wang, Zengzhi and Liu, Pengfei},
year={2025},
journal={arXiv preprint arXiv:2507.16812},
url={https://arxiv.org/abs/2507.16812}
}
# [MegaScience:推动科学推理领域后训练数据集(Post-Training Datasets)的前沿边界](https://arxiv.org/abs/2507.16812)
**代码仓库:** https://github.com/GAIR-NLP/MegaScience
**项目主页:** https://huggingface.co/MegaScience
MegaScience是一个包含125万条样本的高质量开源数据集混合集合。我们首先收集多份公开数据集,随后针对不同数据选择方法开展全面的消融实验,为每个数据集确定最优处理方案,最终构建得到高质量子集。此外,我们为除TextbookReasoning之外的所有数据集标注了逐步解题过程。
我们基于MegaScience对Llama3.1、Qwen2.5及Qwen3系列基础模型进行微调,其在科学推理任务上的平均性能优于官方指令模型,成功推动了开源社区在科学领域的前沿边界。我们发现,MegaScience对规模更大、能力更强的模型效果更为显著,这表明科学指令微调存在模型缩放收益。

## 示例用法
若要使用MegaScience数据集,可通过Git LFS克隆仓库:
bash
git lfs install
git clone https://huggingface.co/datasets/MegaScience/MegaScience
若需了解数据处理、监督微调与评估的详细信息,请参阅[MegaScience GitHub仓库](https://github.com/GAIR-NLP/MegaScience)中的完整指南。
## MegaScience的构建流程

**步骤1**:从NaturalReasoning、Nemotron-Science与TextbookReasoning数据集中筛选源数据。
**步骤2**:执行问题去重与基于大语言模型的数据污染去除。
**步骤3**:针对不同数据选择方法开展全面消融实验,为每个数据集确定最优处理方案,最终构建高质量子集。
**步骤4**:使用DeepSeek-V3为NaturalReasoning与Nemotron-Science数据集标注逐步解题过程。
## 数据质量验证
基于MegaScience微调的模型在科学推理任务上的性能显著优于其对应的官方指令模型。值得注意的是,即使基于当前最先进的Qwen3模型进行微调,经MegaScience训练的模型仍持续优于性能强劲的Qwen3-Instruct基准模型。此外,MegaScience展现出优异的可扩展性:随着基础模型规模增大,MegaScience带来的性能提升愈发显著。
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="instruct_megascience_comparsion.png" alt="Results" style="width:80%;">
</div>
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="push_results.png" alt="Results" style="width:80%;">
</div>
## 引用
如需了解更多细节,请查阅我们的[论文](https://arxiv.org/abs/2507.16812)。若您使用本数据集或认为本工作对您有所帮助,请引用以下文献:
@article{fan2025megascience,
title={MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning},
author={Fan, Run-Ze and Wang, Zengzhi and Liu, Pengfei},
year={2025},
journal={arXiv preprint arXiv:2507.16812},
url={https://arxiv.org/abs/2507.16812}
}
提供机构:
maas
创建时间:
2025-07-31



