five

MegaScience/MegaScience

收藏
Hugging Face2025-07-24 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/MegaScience/MegaScience
下载链接
链接失效反馈
官方服务:
资源简介:
MegaScience是一个大规模的高质量开源数据集混合体,包含125万条实例。该数据集首先从多个公共数据集中收集数据,然后通过不同的数据选择方法进行全面的消融研究,以确定每个数据集的最优方法,从而贡献高质量的子集。除了TextbookReasoning之外,还为所有数据集标注了分步解决方案。在MegaScience上训练的Llama3.1、Qwen2.5和Qwen3系列基模型在科学推理性能上超过了官方的指导模型,成功推动了开源社区在科学领域的前沿。MegaScience对大型和强大模型表现出更大的有效性,表明科学指导微调具有规模效应。

MegaScience is a large-scale mixture of high-quality open-source datasets consisting of 1.25 million instances. The dataset is curated by collecting data from multiple public datasets and conducting comprehensive ablation studies to identify the optimal data selection methods for each dataset, contributing high-quality subsets. Step-by-step solutions are annotated for all datasets except TextbookReasoning. Models trained on MegaScience, including Llama3.1, Qwen2.5, and Qwen3 series base models, outperform the official instruct models in scientific reasoning performance, indicating a scaling benefit for scientific instruction tuning.
提供机构:
MegaScience
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作