SCP-116K
收藏魔搭社区2026-01-06 更新2025-02-08 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/SCP-116K
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for SCP-116K
## **Recent Updates**
We have made significant updates to the dataset, which are summarized below:
1. **Expansion with Mathematics Data**:
Added over 150,000 new math-related problem-solution pairs, bringing the total number of examples to **274,166**. Despite this substantial expansion, we have retained the original dataset name (`SCP-116K`) to maintain continuity and avoid disruption for users who have already integrated the dataset into their workflows.
2. **Updated Responses and Reasoning**:
Removed the previous responses generated by `o1-mini` and `QwQ-32B-preview`. Instead, we now include responses and reasoning processes generated by the **DeepSeek-r1** model. These are stored in two new fields:
- `r1_response`: The solution generated by DeepSeek-r1.
- `r1_reasoning_content`: The detailed reasoning process provided by DeepSeek-r1.
Note that these new responses do not include information on whether they match the ground truth solutions extracted from the source material.
3. **Renaming of Fields**:
The field `matched_solution` has been renamed to `extracted_solution` to better reflect its nature as a solution extracted directly from the source documents, avoiding potential ambiguity.
### **Upcoming Updates**
We are actively working on further improvements, including:
1. **Improved OCR Pipeline**:
We have identified that **Qwen2.5-VL-72B** demonstrates superior OCR capabilities compared to the previously used GPT-4o. We will soon update the dataset extraction pipeline to incorporate this model for enhanced OCR performance.
2. **Addressing Solution Extraction Deficiency**:
A known issue where the number of extracted solutions is significantly lower than the number of extracted problems has been traced back to limitations in GPT-4o's capabilities. This issue will be resolved in the next version of the dataset.
---
## Dataset Description
### Paper
[SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain](https://arxiv.org/abs/2501.15587)
### Dataset Summary
SCP-116K is a large-scale dataset containing **274,166 high-quality scientific problem-solution pairs**, automatically extracted from web-crawled documents. The dataset covers multiple scientific disciplines, including physics, chemistry, biology, and now mathematics, targeting undergraduate to doctoral-level content. Each problem is accompanied by its matched solution extracted from the source material, along with responses and reasoning processes generated by advanced language models.
GitHub: [https://github.com/AQA6666/SCP-116K-open/tree/main](https://github.com/AQA6666/SCP-116K-open/tree/main)
### Supported Tasks
The dataset supports several tasks:
- Scientific Question Answering
- Scientific Reasoning
- Model Evaluation
- Knowledge Distillation
### Languages
The dataset is in English.
### Dataset Structure
The dataset contains the following columns:
- `domain`: The scientific domain of the problem (e.g., physics, chemistry, biology, mathematics).
- `problem`: The original problem text.
- `extracted_solution`: The solution extracted from the source material (previously named `matched_solution`).
- `r1_response`: Solution generated by the DeepSeek-r1 model.
- `r1_reasoning_content`: Detailed reasoning process provided by the DeepSeek-r1 model.
### Data Fields
- `domain`: string
- `problem`: string
- `extracted_solution`: string
- `r1_response`: string
- `r1_reasoning_content`: string
### Data Splits
The dataset is provided as a single split containing all **274,166** examples.
---
## Dataset Creation
### Source Data
The dataset was created by processing over **6.69 million academic documents**, filtering for high-quality university-level content, and extracting problem-solution pairs using a sophisticated automated pipeline. The extraction process includes document retrieval, unified preprocessing, content segmentation, structured extraction, quality filtering, and problem-solution matching.
### Annotations
The dataset includes solutions and reasoning processes generated by the **DeepSeek-r1** model. Each generated solution is provided without explicit validation against the ground truth solution extracted from the source material.
---
## Considerations for Using the Data
### Social Impact of Dataset
This dataset aims to advance scientific reasoning capabilities in AI systems and provide high-quality training data for developing more capable models in STEM disciplines. It can help democratize access to advanced scientific problem-solving capabilities and support education in scientific fields.
### Discussion of Biases
While efforts have been made to ensure high quality and diversity in the dataset, users should be aware that:
- The dataset may reflect biases present in web-crawled documents.
- Coverage across different scientific domains may not be perfectly balanced.
- The difficulty level of problems varies across the dataset.
### Other Known Limitations
- Solutions may occasionally reference figures or equations not included in the text.
- Some problems may require specialized domain knowledge for full understanding.
- The dataset focuses primarily on theoretical problems rather than experimental ones.
---
## Additional Information
### Dataset Curators
The dataset was created as part of research work on improving scientific reasoning capabilities in language models.
### Licensing Information
This dataset is released under the **cc-by-nc-sa-4.0 License**.
### Citation Information
If you use this dataset in your research, please cite:
```bibtex
@misc{lu2025scp116khighqualityproblemsolutiondataset,
title={SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain},
author={Dakuan Lu and Xiaoyu Tan and Rui Xu and Tianchu Yao and Chao Qu and Wei Chu and Yinghui Xu and Yuan Qi},
year={2025},
eprint={2501.15587},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.15587},
}
```
# SCP-116K 数据集卡片
## **近期更新**
我们对该数据集进行了重大更新,具体总结如下:
1. **新增数学数据拓展**:
新增超过150,000条全新的数学相关问题-解决方案对,使总样本量达到**274,166**条。尽管进行了如此大规模的扩容,我们仍保留了原数据集名称(`SCP-116K`)以保证连续性,避免对已将该数据集集成到工作流程中的用户造成干扰。
2. **更新回复与推理过程**:
移除了此前由`o1-mini`与`QwQ-32B-preview`生成的回复,现替换为**DeepSeek-r1**模型生成的回复与推理过程,这些内容存储于两个新增字段中:
- `r1_response`:由DeepSeek-r1生成的解决方案。
- `r1_reasoning_content`:DeepSeek-r1提供的详细推理过程。
需注意,这些新生成的回复未附带其与源材料中提取的基准解决方案是否匹配的相关信息。
3. **字段重命名**:
原字段`matched_solution`现已更名为`extracted_solution`,以更准确地反映其直接从源文档中提取的解决方案的本质,避免潜在歧义。
### **即将推出的更新**
我们正积极推进进一步的改进,包括:
1. **优化光学字符识别(Optical Character Recognition,OCR)流水线**:
我们已确认,相较于此前使用的GPT-4o,**Qwen2.5-VL-72B**具备更优异的OCR能力。我们将很快更新数据集提取流水线,引入该模型以提升OCR性能。
2. **解决解决方案提取不足的问题**:
目前已知存在一个问题:提取得到的解决方案数量远低于提取的问题数量,该问题的根源在于GPT-4o的能力局限。我们将在数据集的下一版本中解决该问题。
---
## 数据集说明
### 相关论文
[SCP-116K:高质量问题-解决方案数据集与高等教育科学领域自动化提取通用流水线](https://arxiv.org/abs/2501.15587)
### 数据集概览
SCP-116K是一个大规模数据集,包含**274,166条高质量科学问题-解决方案对**,均从网络爬取的文档中自动提取而来。该数据集涵盖物理学、化学、生物学以及新增的数学等多个科学学科,内容面向本科至博士阶段的学习材料。每个问题均附带从源材料中提取的匹配解决方案,以及由先进大语言模型生成的回复与推理过程。
GitHub: [https://github.com/AQA6666/SCP-116K-open/tree/main](https://github.com/AQA6666/SCP-116K-open/tree/main)
### 支持任务
该数据集支持以下任务:
- 科学问答
- 科学推理
- 模型评估
- 知识蒸馏
### 语言
该数据集采用英文。
### 数据集结构
该数据集包含以下字段:
- `domain`:问题所属的科学学科(例如物理学、化学、生物学、数学)。
- `problem`:原始问题文本。
- `extracted_solution`:从源材料中提取的解决方案(原字段名为`matched_solution`)。
- `r1_response`:由DeepSeek-r1模型生成的解决方案。
- `r1_reasoning_content`:由DeepSeek-r1模型提供的详细推理过程。
### 数据字段
各字段类型如下:
- `domain`:字符串类型
- `problem`:字符串类型
- `extracted_solution`:字符串类型
- `r1_response`:字符串类型
- `r1_reasoning_content`:字符串类型
### 数据划分
该数据集仅提供单一划分,包含全部**274,166**条样本。
---
## 数据集构建
### 源数据
该数据集通过处理超过**669万**篇学术文档,筛选出高质量的大学阶段内容,并通过一套复杂的自动化流水线提取问题-解决方案对。提取流程涵盖文档检索、统一预处理、内容分段、结构化提取、质量过滤以及问题-解决方案匹配。
### 标注信息
该数据集包含由**DeepSeek-r1**模型生成的解决方案与推理过程。每个生成的解决方案均未经过与源材料中提取的基准解决方案的显式验证。
---
## 数据集使用注意事项
### 数据集的社会影响
本数据集旨在提升AI系统的科学推理能力,为开发STEM(Science, Technology, Engineering, Mathematics)学科领域的高性能模型提供高质量训练数据。它有助于推动高级科学问题求解能力的普惠化,并为科学领域的教育提供支持。
### 偏差说明
尽管已尽力确保数据集的高质量与多样性,用户仍需注意:
- 数据集可能反映网络爬取文档中存在的偏差。
- 不同科学学科的覆盖范围可能并非完全均衡。
- 数据集内问题的难度水平存在差异。
### 其他已知局限性
- 解决方案偶尔可能引用文本中未包含的图表或公式。
- 部分问题需要具备特定的领域知识才能完全理解。
- 数据集主要聚焦于理论问题,而非实验类问题。
---
## 附加信息
### 数据集维护者
该数据集是为提升大语言模型的科学推理能力的研究工作而创建的。
### 许可信息
本数据集采用**CC BY-NC-SA 4.0**许可协议发布。
### 引用信息
若您在研究中使用该数据集,请引用以下文献:
bibtex
@misc{lu2025scp116khighqualityproblemsolutiondataset,
title={SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain},
author={Dakuan Lu and Xiaoyu Tan and Rui Xu and Tianchu Yao and Chao Qu and Wei Chu and Yinghui Xu and Yuan Qi},
year={2025},
eprint={2501.15587},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.15587},
}
提供机构:
maas
创建时间:
2025-02-04



