chempile-education
收藏魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/chempile-education
下载链接
链接失效反馈官方服务:
资源简介:
# ChemPile-Education
<div align="center">

[](https://huggingface.co/datasets/jablonkagroup/chempile-education)
[](https://creativecommons.org/licenses/by-nc-sa/4.0/)
[](https://arxiv.org/abs/2505.12534)
[](https://chempile.lamalab.org/)
*A comprehensive chemistry education dataset containing 129.26M+ tokens from diverse educational sources*
</div>
## 📋 Dataset Summary
ChemPile-Education is a large-scale chemistry natural language dataset extracted from diverse educational resources, including open-source textbooks, lecture notes, transcripts, and other educational materials. This dataset captures fundamental chemistry knowledge and concepts as students would encounter them in textbooks and classroom settings.
### 📊 Dataset Statistics
| Subset | Tokens | Documents | Description |
|--------|--------|-----------|-------------|
| LibreText Chemistry | 114M | 58,9K | Open-source chemistry textbooks |
| MIT OCW Lecture Transcripts | 10M | 58K | University-level lecture transcripts |
| US Olympiad Problems | 260K | 1,22K | Competition chemistry problems with solutions |
| YouTube Transcripts | 5M | 5,09K | Educational chemistry video transcripts |
| **Total** | **~129.26M** | **~123K** | Chemical educational content |
## 🗂️ Dataset Configurations
The dataset includes four distinct subsets available as Hugging Face configurations:
- `LibreText_Chemistry-default`
- `mit-ocw-lecture-transcripts-default`
- `us-olympiad-problems-default`
- `youtube-transcripts-as-lectures-default`
## 📜 License
All content is released under the **CC BY-NC-SA 4.0** license, which allows for:
- ✅ Non-commercial use
- ✅ Sharing and redistribution
- ✅ Adaptation and modification
- ⚠️ Attribution required
- ⚠️ Share-alike (derivatives must use same license)
## 📖 Dataset Details
### 🧪 LibreText Chemistry
**Source**: [LibreTexts Chemistry](https://chem.libretexts.org/) - A comprehensive open-source chemistry education platform
**Coverage**: General chemistry, organic chemistry, inorganic chemistry, physical chemistry, and biochemistry
**Extraction Method**: Custom web scraper for HTML content extraction
**Fields**:
- `url`: Source URL for transparency and verification
- `text`: Educational content about chemistry concepts
**Statistics**: 114M tokens across 58,946 documents covering undergraduate to graduate-level chemistry topics
### 🎓 MIT OCW Lecture Transcripts
**Source**: [MIT OpenCourseWare](https://ocw.mit.edu/) - Free online course materials from MIT
**Coverage**: Chemistry, biology, chemical engineering, and physics courses
**Extraction Method**: Automated download using keyword-based course identification
**Fields**:
- `course`: Course name and identifier
- `url`: Original source URL for reference
- `topic`: Specific lecture topic
- `text`: Lecture transcript content
- `index`: Document identifier
**Statistics**: 10M tokens from approximately 500 university-level lectures
### 🏆 US Olympiad Problems
**Source**: [American Chemical Society (ACS)](https://www.acs.org/) US Chemistry Olympiad materials
**Coverage**: Competition-level chemistry problems with detailed solutions
**Extraction Method**: PDF processing with LLM-generated Q&A pairs using *Gemini 2.0 Flash Thinking Experimental 01-21*
**Quality Control**:
- Minimum 250 characters per answer
- Expert manual review by chemistry professionals
- Filtered for educational value
**Fields**:
- `metadata`: Problem details (year, number, topic)
- `problem_statement`: Original olympiad problem
- `options`: Multiple choice options
- `solution`: Detailed solution explanation
- `correct_answer`: Correct option identifier
- `text`: Combined problem and solution content
- `filter`: Quality control flag
- `index`: Document identifier
**Statistics**: 260K+ tokens from ~400 competition problems with comprehensive solutions
### 📺 YouTube Transcripts as Lectures
**Source**: Educational chemistry videos from YouTube with permissive CC licenses
**Coverage**: Diverse chemistry topics from educational content creators
**Extraction Method**:
- LLM-generated search keywords for educational chemistry content
- Automated transcript extraction from CC-licensed videos
- Content cleaning and structuring using *GPT-4.1*
**Fields**:
- `id`: Unique YouTube video identifier
- `title`: Video title
- `text`: Cleaned and structured transcript content
- `index`: Document identifier
**Statistics**: 5M+ tokens from ~1,000 educational chemistry videos
## 🚀 Quick Start
```python
from datasets import load_dataset, get_dataset_config_names
# List all available configurations
configs = get_dataset_config_names("jablonkagroup/chempile-education")
print(f"Available configs: {configs}")
# ['LibreText_Chemistry-default', 'mit-ocw-lecture-transcripts-default',
# 'us-olympiad-problems-default', 'youtube-transcripts-as-lectures-default']
# Load a specific subset
dataset = load_dataset("jablonkagroup/chempile-education", name="LibreText_Chemistry-default")
print(dataset)
# DatasetDict({
# train: Dataset({
# features: ['url', 'text'],
# num_rows: 53051
# })
# test: Dataset({
# features: ['url', 'text'],
# num_rows: 2948
# })
# val: Dataset({
# features: ['url', 'text'],
# num_rows: 2947
# })
# })
# Access a sample
sample = dataset['train'][0]
print(f"Sample URL: {sample['url'][:50]}...")
print(f"Sample text: {sample['text'][:200]}...")
```
## 🎯 Use Cases
- **🤖 Language Model Training**: Pre-training or fine-tuning models for chemistry domain
- **📚 Educational AI**: Building chemistry tutoring and Q&A systems
- **🔍 Information Retrieval**: Chemistry knowledge base construction
- **📝 Content Generation**: Automated chemistry educational content creation
- **🧠 Domain Adaptation**: Adapting general models to chemistry domain
## ⚠️ Limitations & Considerations
- **Language**: English only (monolingual dataset)
- **Scope**: Focused on educational content; may not cover cutting-edge research and some knowledge might be missing
- **Quality**: Variable quality across sources; some automated transcripts may contain errors
- **Bias**: Reflects biases present in educational materials and sources
- **License**: Non-commercial use only due to CC BY-NC-SA 4.0 license
## 🛠️ Data Processing Pipeline
1. **Collection**: Automated scraping and API-based collection from sources
2. **Extraction**: Text extraction from various formats (HTML, PDF, video transcripts)
3. **Cleaning**: LLM-assisted content cleaning and structuring
4. **Quality Control**: Expert review and filtering
5. **Standardization**: Consistent formatting and metadata addition
6. **Validation**: Split creation and final quality checks
## 🏗️ ChemPile Collection
This dataset is part of the **ChemPile** collection, a comprehensive open dataset containing over 75 billion tokens of curated chemical data for training and evaluating general-purpose models in the chemical sciences.
### Collection Overview
- **📊 Scale**: 75+ billion tokens across multiple modalities
- **🧬 Modalities**: Structured representations (SMILES, SELFIES, IUPAC, InChI), scientific text, executable code, and molecular images
- **🎯 Design**: Integrates foundational educational knowledge with specialized scientific literature
- **🔬 Curation**: Extensive expert curation and validation
- **📈 Benchmarking**: Standardized train/validation/test splits for robust evaluation
- **🌐 Availability**: Openly released via Hugging Face
## 📄 Citation
If you use this dataset in your research, please cite:
```bibtex
@article{mirza2025chempile0,
title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models},
author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others},
year = {2025},
journal = {arXiv preprint arXiv:2505.12534}
}
```
## 👥 Contact & Support
- **Paper**: [arXiv:2505.12534](https://arxiv.org/abs/2505.12534)
- **Website**: [ChemPile Project](https://chempile.lamalab.org/)
- **Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/chempile-education)
- **Issues**: Please report data issues or questions via the Hugging Face dataset page
---
<div align="center">

<i>Part of the ChemPile project - Advancing AI for Chemical Sciences</i>
</div>
# ChemPile-Education
<div align="center">

[](https://huggingface.co/datasets/jablonkagroup/chempile-education)
[](https://creativecommons.org/licenses/by-nc-sa/4.0/)
[](https://arxiv.org/abs/2505.12534)
[](https://chempile.lamalab.org/)
*一个涵盖多样化教育资源的综合性化学教育数据集,包含超129.26M个Token(Token)*
</div>
## 📋 数据集概述
ChemPile-Education是一个大规模化学自然语言数据集,提取自多样化的教育资源,包括开源教科书、课堂讲义、讲稿及其他教育材料。该数据集涵盖了学生在教科书与课堂场景中会接触到的基础化学知识与概念。
### 📊 数据集统计数据
| 子集 | Token数 | 文档数 | 描述 |
|--------|--------|-----------|-------------|
| LibreText Chemistry | 114M | 58.9K | 开源化学教科书 |
| MIT OCW课堂讲稿 | 10M | 58K | 大学层级课堂讲稿 |
| 美国化学奥赛试题 | 260K | 1.22K | 带解析的竞赛级化学试题 |
| YouTube视频讲稿 | 5M | 5.09K | 化学教育视频讲稿 |
| **总计** | **~129.26M** | **~123K** | 化学教育内容 |
## 🗂️ 数据集配置
该数据集包含四个独立子集,可作为Hugging Face配置使用:
- `LibreText_Chemistry-default`
- `mit-ocw-lecture-transcripts-default`
- `us-olympiad-problems-default`
- `youtube-transcripts-as-lectures-default`
## 📜 授权协议
所有内容均采用**CC BY-NC-SA 4.0**协议发布,该协议允许:
- ✅ 非商业使用
- ✅ 共享与再分发
- ✅ 改编与修改
- ⚠️ 需注明原作者
- ⚠️ 衍生作品需采用相同授权协议
## 📖 数据集详情
### 🧪 LibreText Chemistry 数据集
**数据来源**:[LibreTexts Chemistry](https://chem.libretexts.org/)——一个综合性开源化学教育平台
**覆盖范围**:普通化学、有机化学、无机化学、物理化学与生物化学
**提取方法**:使用自定义网络爬虫提取HTML内容
**字段说明**:
- `url`: 用于溯源与验证的源URL
- `text`: 化学概念相关的教育内容
**统计数据**:涵盖本科至研究生层级化学主题的58946份文档,共计114M个Token(Token)
### 🎓 MIT OCW课堂讲稿数据集
**数据来源**:[MIT开放课程(MIT OpenCourseWare,MIT OCW)](https://ocw.mit.edu/)——麻省理工学院免费在线课程材料
**覆盖范围**:化学、生物学、化学工程与物理学课程
**提取方法**:通过基于关键词的课程识别实现自动化下载
**字段说明**:
- `course`: 课程名称与标识符
- `url`: 用于参考的原始源URL
- `topic`: 具体课堂主题
- `text`: 课堂讲稿内容
- `index`: 文档标识符
**统计数据**:来自约500份大学层级课堂讲稿,共计10M个Token(Token)
### 🏆 美国化学奥赛试题数据集
**数据来源**:[美国化学学会(American Chemical Society, ACS)](https://www.acs.org/)美国化学奥赛材料
**覆盖范围**:带详细解析的竞赛级化学试题
**提取方法**:使用大语言模型(Large Language Model,LLM)结合*Gemini 2.0 Flash Thinking Experimental 01-21*生成问答对,对PDF文件进行处理
**质量控制**:
- 每份答案至少250个字符
- 由化学专业人士进行人工审核
- 筛选以确保教育价值
**字段说明**:
- `metadata`: 试题详情(年份、题号、主题)
- `problem_statement`: 原始奥赛试题
- `options`: 选择题选项
- `solution`: 详细解析
- `correct_answer`: 正确选项标识
- `text`: 试题与解析的合并内容
- `filter`: 质量控制标记
- `index`: 文档标识符
**统计数据**:来自约400道竞赛试题及完整解析,共计260K+个Token(Token)
### 📺 作为课堂讲稿的YouTube视频讲稿数据集
**数据来源**:来自YouTube的带宽松知识共享许可的化学教育视频
**覆盖范围**:来自各类教育内容创作者的多样化化学主题
**提取方法**:
- 大语言模型(LLM)生成化学教育内容的搜索关键词
- 从知识共享许可的视频中自动提取讲稿
- 使用*GPT-4.1*进行内容清理与结构化
**字段说明**:
- `id`: 唯一的YouTube视频标识符
- `title`: 视频标题
- `text`: 清理并结构化后的讲稿内容
- `index`: 文档标识符
**统计数据**:来自约1000份化学教育视频,共计5M+个Token(Token)
## 🚀 快速上手
python
from datasets import load_dataset, get_dataset_config_names
# 列出所有可用配置
configs = get_dataset_config_names("jablonkagroup/chempile-education")
print(f"可用配置:{configs}")
# ['LibreText_Chemistry-default', 'mit-ocw-lecture-transcripts-default',
# 'us-olympiad-problems-default', 'youtube-transcripts-as-lectures-default']
# 加载指定子集
dataset = load_dataset("jablonkagroup/chempile-education", name="LibreText_Chemistry-default")
print(dataset)
# DatasetDict({
# train: Dataset({
# features: ['url', 'text'],
# num_rows: 53051
# })
# test: Dataset({
# features: ['url', 'text'],
# num_rows: 2948
# })
# val: Dataset({
# features: ['url', 'text'],
# num_rows: 2947
# })
# })
# 访问样本
sample = dataset['train'][0]
print(f"样本URL:{sample['url'][:50]}...")
print(f"样本文本:{sample['text'][:200]}...")
## 🎯 应用场景
- **🤖 大语言模型训练**:用于化学领域模型的预训练或微调
- **📚 教育AI**:构建化学辅导与问答系统
- **🔍 信息检索**:构建化学知识库
- **📝 内容生成**:自动生成化学教育内容
- **🧠 领域适配**:将通用模型适配至化学领域
## ⚠️ 局限性与注意事项
- **语言**:仅支持英语(单语数据集)
- **范围**:聚焦教育内容,未涵盖前沿研究,部分知识可能缺失
- **质量**:不同来源的质量参差不齐,部分自动提取的讲稿可能存在错误
- **偏差**:反映教育材料及来源中存在的偏差
- **授权协议**:由于采用CC BY-NC-SA 4.0协议,仅可用于非商业用途
## 🛠️ 数据处理流程
1. **采集**:从各来源进行自动化爬虫与API采集
2. **提取**:从多种格式(HTML、PDF、视频讲稿)中提取文本
3. **清理**:借助大语言模型(LLM)进行内容清理与结构化
4. **质量控制**:人工审核与筛选
5. **标准化**:统一格式并添加元数据
6. **验证**:数据集划分与最终质量检查
## 🏗️ ChemPile数据集合集
该数据集属于**ChemPile**合集,这是一个综合性开源数据集,包含超750亿个经过筛选的化学数据Token,用于化学科学领域通用模型的训练与评估。
### 合集概览
- **📊 规模**:涵盖多模态数据,超750亿个Token
- **🧬 模态**:结构化表示(SMILES、SELFIES、IUPAC、InChI)、科学文本、可执行代码与分子图像
- **🎯 设计理念**:融合基础教育知识与专业科学文献
- **🔬 筛选流程**:经过严格的专家筛选与验证
- **📈 基准测试**:标准化的训练/验证/测试划分,支持可靠的模型评估
- **🌐 可用性**:通过Hugging Face公开发布
## 📄 引用格式
若您在研究中使用该数据集,请引用以下文献:
bibtex
@article{mirza2025chempile0,
title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models},
author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others},
year = {2025},
journal = {arXiv preprint arXiv:2505.12534}
}
## 👥 联系与支持
- **论文**:[arXiv:2505.12534](https://arxiv.org/abs/2505.12534)
- **项目官网**:[ChemPile项目](https://chempile.lamalab.org/)
- **数据集页面**:[Hugging Face数据集页面](https://huggingface.co/datasets/jablonkagroup/chempile-education)
- **问题反馈**:请通过Hugging Face数据集页面提交数据问题或咨询
---
<div align="center">

<i>隶属于ChemPile项目——推动化学科学领域AI发展</i>
</div>
提供机构:
maas
创建时间:
2025-05-29



