five

TextbookReasoning

收藏
魔搭社区2025-12-05 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/MegaScience/TextbookReasoning
下载链接
链接失效反馈
官方服务:
资源简介:
# [MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning](https://arxiv.org/abs/2507.16812) ## Dataset Description Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research. ## Links - **Paper:** [MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning](https://arxiv.org/abs/2507.16812) - **GitHub Repository:** [https://github.com/GAIR-NLP/MegaScience](https://github.com/GAIR-NLP/MegaScience) ![](main_figure.png) ## The Curation of TextbookReasoning ![](textbook_reasoning_overall_figure.png) **Step 1**: Collect a diverse set of university-level science textbooks and convert the PDF documents into machine-readable text using the olmOCR pipeline. **Step 2**: Utilize Llama3.3-70B-Instruct to extract Q-A pairs automatically from the processed textbook content. **Step 3**: Perform question deduplication. **Step 4**: Refine the extracted Q-A pairs given the relevant source documents using DeepSeek-V3. **Step 4**: Perform LLM-based question decontamination. ## Demonstration of Data Quality Models trained on TextbookReasoning significantly outperform other science datasets on scientific reasoning tasks. <div style="display: flex; justify-content: center; gap: 20px;"> <img src="main_results.png" alt="Results" style="width:70%;"> </div> ## Sample Usage You can easily load this dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset # Load the "TextbookReasoning" dataset dataset = load_dataset("MegaScience/TextbookReasoning") # Access the training split train_data = dataset["train"] # Print the first example print(train_data[0]) # Example output structure: # { # 'question': 'What are the three main types of rocks?', # 'answer': 'The three main types of rocks are igneous, sedimentary, and metamorphic.', # 'subject': 'geology', # 'reference_answer': 'Igneous, sedimentary, and metamorphic.' # } ``` ## Citation Check out our [paper](https://arxiv.org/abs/2507.16812) for more details. If you use our dataset or find our work useful, please cite ``` @article{fan2025megascience, title={MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning}, author={Fan, Run-Ze and Wang, Zengzhi and Liu, Pengfei}, year={2025}, journal={arXiv preprint arXiv:2507.16812}, url={https://arxiv.org/abs/2507.16812} } ```

# MegaScience:推动科学推理后训练数据集的前沿边界 ## 数据集概述 科学推理对于培育AI科学家、助力人类科研工作者开拓自然科学发现的前沿边界至关重要。然而,开源社区此前主要聚焦于数学与代码领域,却忽视了科学学科赛道,核心原因在于缺乏开放、大规模、高质量且可验证的科学推理数据集。为填补这一空白,我们首先推出**TextbookReasoning(教材推理数据集)**:这一开源数据集的真实参考答案取自1.2万份大学级科学教材,涵盖7个科学学科,共计65万个推理问题。我们进一步推出MegaScience:这是一个由高质量开源数据集混合而成的大规模数据集,总计包含125万个样本。该数据集通过系统性消融实验构建,我们评估了多种数据选择方法,以从每一个公开可用的科学数据集中筛选出最优子集。与此同时,我们构建了一套全面的评估体系,覆盖15个基准测试中的多样学科与问题类型,并采用完善的答案抽取策略,以保障评估指标的准确性。实验结果表明,相较于现有开源科学数据集,本团队构建的数据集在性能与训练效率上更具优势,且生成的响应长度更为简洁。此外,我们基于MegaScience对Llama3.1、Qwen2.5及Qwen3系列基础模型进行微调,这些模型在平均性能上显著优于对应的官方指令模型。进一步研究发现,MegaScience对更大更强的模型展现出更强的适配效果,这表明科学领域微调存在模型规模收益。我们将数据整理流水线、评估体系、数据集以及7个微调模型开源发布,以推动科学推理领域的研究进展。 ## 链接 - **论文**:[MegaScience:推动科学推理后训练数据集的前沿边界](https://arxiv.org/abs/2507.16812) - **GitHub仓库**:[https://github.com/GAIR-NLP/MegaScience](https://github.com/GAIR-NLP/MegaScience) ![](main_figure.png) ## TextbookReasoning数据集构建流程 ![](textbook_reasoning_overall_figure.png) **步骤1**:收集多类大学级科学教材,并通过olmOCR流水线将PDF文档转换为机器可读文本。 **步骤2**:使用Llama3.3-70B-Instruct从处理后的教材内容中自动抽取问答对。 **步骤3**:对问题进行去重处理。 **步骤4**:结合相关源文档,使用DeepSeek-V3对抽取得到的问答对进行精炼优化。 **步骤4**:基于大语言模型完成问题去污染处理。 ## 数据质量验证 基于TextbookReasoning训练的模型,在科学推理任务上的性能显著优于基于其他科学数据集训练的模型。 <div style="display: flex; justify-content: center; gap: 20px;"> <img src="main_results.png" alt="Results" style="width:70%;"> </div> ## 示例用法 您可以通过Hugging Face的`datasets`库轻松加载本数据集: python from datasets import load_dataset # Load the "TextbookReasoning" dataset dataset = load_dataset("MegaScience/TextbookReasoning") # Access the training split train_data = dataset["train"] # Print the first example print(train_data[0]) # Example output structure: # { # 'question': 'What are the three main types of rocks?', # 'answer': 'The three main types of rocks are igneous, sedimentary, and metamorphic.', # 'subject': 'geology', # 'reference_answer': 'Igneous, sedimentary, and metamorphic.' # } ## 引用 如需了解更多细节,请查阅我们的[论文](https://arxiv.org/abs/2507.16812)。若您使用本数据集或认为本工作对您有所帮助,请引用以下文献: @article{fan2025megascience, title={MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning}, author={Fan, Run-Ze and Wang, Zengzhi and Liu, Pengfei}, year={2025}, journal={arXiv preprint arXiv:2507.16812}, url={https://arxiv.org/abs/2507.16812} }
提供机构:
maas
创建时间:
2025-07-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作