datashop-science-qa

Name: datashop-science-qa
Creator: maas
Published: 2025-12-05 11:31:59
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/marin-community/datashop-science-qa

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Datashop Science QA   ## Dataset Details ### Dataset Description  This science-focused dataset was curated by applying model-based filtering to the [DCLM Baseline](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) dataset, extracting around 40B Llama-3 tokens of data, which were later rewritten into QA pairs format by Llama-3.1-8B-Instruct. It yields strong out of the box performance for improving MMLU scores, particularly the MMLU STEM subset. We observe +4 point increase in the MMLU STEM subset with this dataset compared to a dataset that is trained on just DCLM Baseline for the same number of tokens. #### Stage 1: Extracting science content from DCLM Baseline We first annotate around 500K documents for quality scores to extract educational science content. This prompt was generated using an automatic LLM labeling process described in the MEDU paper. Here is the prompt below: ``` The following document is being considered as training data for a Large Language Model. First, provide a concise description of the document and an assessment of the quality of the text or code in the document. Key Attributes to Mention - Languages contained in the document - The coherence of the document - The skills the document demonstrates - The topics the document contains facts and information about Document: ''' {example} ''' Based on your reasoning, give me a concrete decision about the utility of the document as training data for the following benchmark. ## Test Type This evaluation appears to be a comprehensive assessment of a Large Language Model's knowledge and understanding across various scientific disciplines, potentially from a professional certification exam, university course, or specialized examination. ## Required Languages, Skills, and Knowledge The language model would need to possess English language proficiency, strong reasoning and problem-solving skills, and a broad and deep knowledge base in multiple scientific fields, including biology, chemistry, physics, medicine, astronomy, and anatomy. It should have skills in comprehensive knowledge retrieval, analysis of complex concepts, critical thinking, and mathematical understanding, as well as familiarity with technical and scientific vocabulary, precise language, and specialized terminology. ## Ideal Training Data The ideal training data should be a vast, diverse, and extensive collection of texts from various disciplines, including scientific articles, textbooks, academic papers, practice exams, educational materials, and websites. The data should cover a wide range of topics, question types, and formats, such as explanatory texts, problem sets, and question-answer pairs, to prepare the language model for the breadth and depth of questions it may encounter in the evaluation, and should be carefully curated to ensure it is accurate and up-to-date. Output your decision about the utility of the data as "Final Score:" followed by one of the following words Great/Good/Okay/Poor/Useless. ``` We then utilize the annotated document to train a quality filter model which is later used to label quality over 400B DCLM tokens in which we filter the top 10%. #### Stage 2: Rewriting into QA Pairs Inspired by the Rephrase the Web paper, we then rewrite the documents into QA format using their QA prompt. This yields about 12.6B Llama-3 tokens, which is the final dataset. ### Dataset Sources  - **Repository:** [https://github.com/marin-community/marin]

# Datashop Science QA 数据集卡片   ## 数据集详情 ### 数据集描述  本数据集以科学内容为核心，通过对[DCLM Baseline](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0)数据集应用基于模型的筛选方法进行构建，从中提取了约400亿个Llama-3 Token（标记），随后由Llama-3.1-8B-Instruct将这些数据重写为问答（QA）对格式。该数据集在开箱即用的情况下即可有效提升大规模多任务语言理解（MMLU）分数，尤其对MMLU科学、技术、工程与数学（STEM）子集效果显著。相较于仅使用相同Token数量的DCLM Baseline数据集进行训练的模型，使用本数据集训练后，其MMLU STEM子集的得分提升了4个百分点。 #### 第一阶段：从DCLM Baseline中提取科学内容我们首先对约50万份文档进行质量评分标注，以筛选出具备教育价值的科学内容。本标注提示词基于MEDU论文中提及的自动大语言模型（Large Language Model）标注流程生成，具体提示词如下：以下文档将被用作大语言模型的训练数据。请首先对该文档进行简要概述，并评估其文本或代码的质量。需提及的关键属性包括： - 文档所使用的语言 - 文档的连贯性 - 文档所体现的技能 - 文档所涵盖事实与信息的主题文档： ''' {example} ''' 基于你的推理，请针对该文档作为以下基准测试训练数据的实用性给出明确判定。 ## 测试类型本评估旨在全面考察大语言模型在多个科学学科中的知识与理解能力，其内容可能来自专业认证考试、大学课程或专项考核。 ## 所需语言、技能与知识该大语言模型需具备英语语言能力，较强的推理与问题解决能力，以及在生物学、化学、物理学、医学、天文学与解剖学等多个科学领域的广博且深入的知识库。此外，模型需具备全面的知识检索、复杂概念分析、批判性思维与数学理解能力，同时需熟悉专业与科学术语、精准语言表达及专门术语。 ## 理想训练数据理想的训练数据应是涵盖多学科的海量、多样且全面的文本集合，包括科学文章、教科书、学术论文、模拟考试、教育材料与网站内容。该数据需覆盖广泛的主题、题型与格式（如说明性文本、习题集与问答对），以使大语言模型能够应对评估中可能出现的各类广度与深度的问题，同时需经过精心筛选以确保数据的准确性与时效性。请以"Final Score:"开头输出该文档的实用性判定结果，其后跟随以下词汇之一：Great/Good/Okay/Poor/Useless。随后我们使用标注完成的文档训练一个质量筛选模型，该模型后续被用于对超过4000亿个Token的DCLM Baseline数据集进行质量标注，并筛选出其中排名前10%的优质数据。 #### 第二阶段：重写为问答对格式受《Rephrase the Web》论文启发，我们使用该论文提出的问答提示词将筛选后的文档重写为问答对格式，最终得到约126亿个Llama-3 Token的数据集，即本最终数据集。 ### 数据集来源  - **代码仓库：** [https://github.com/marin-community/marin]

提供机构：

maas

创建时间：

2025-10-30

搜集汇总

数据集介绍