chempile-paper
收藏魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/chempile-paper
下载链接
链接失效反馈官方服务:
资源简介:
# ChemPile-Paper
<div align="center">

[](https://huggingface.co/datasets/jablonkagroup/chempile-paper)
[](https://creativecommons.org/licenses/by-nc-nd/4.0/)
[](https://arxiv.org/abs/2505.12534)
*A comprehensive collection of scientific literature spanning academic papers and preprints focused on chemistry and related fields*
</div>
## 📋 Dataset Summary
ChemPile-Paper serves as a resource for cutting-edge applications of chemical knowledge and reasoning, containing curated papers from diverse repositories. This dataset represents a comprehensive collection of scientific literature spanning academic papers and preprints, all focused on chemistry and related fields.
### 📊 Dataset Statistics
| Subset | Tokens | Documents | Description |
|--------|--------|-----------|-------------|
| ArXiv Cond-Mat Materials Science | 35.9M | 5,94K | Materials science papers |
| ArXiv Physics Chemical Physics | 62.3M | 6,75K | Chemical physics papers |
| bioRxiv | 82.6M | 60,3K | Biology preprints |
| medRxiv | 451M | 15,8K | Health sciences preprints |
| ChemRxiv | 210M | 28,9K | Chemistry community preprints |
| EuroPMC Chemistry Abstracts | 3.3B | 10.4M | Scientific literature abstracts |
| EuroPMC Chemistry Papers | 10B | 1.2M | Scientific literature full articles |
| **Total** | **~13.9B** | **~11.7M** | Chemical scientific literature |
## 🗂️ Dataset Configurations
The dataset includes different subsets available as Hugging Face configurations:
- `arxiv-cond-mat.mtrl-sci_processed-default`
- `arxiv-physics.chem-ph_processed-default`
- `biorxiv_processed-default`
- `chemrxiv_processed-default`
- `euro_pmc_chemistry_abstracts-default`
- `euro_pmc_chemistry_papers-default`
- `medrxiv_processed-default`
## 📜 License
All content is released under the **CC BY-NC-ND 4.0** license, which allows for:
- ✅ Non-commercial use
- ✅ Sharing and redistribution
- ⚠️ Attribution required
- ❌ No derivatives allowed
## 📖 Dataset Details
### 📚 ArXiv Subsets
**Sources**:
- Condensed Matter > Materials Science (`arxiv-cond-mat.mtrl-sci_processed-default`)
- Physics > Chemical Physics (`arxiv-physics.chem-ph_processed-default`)
**Coverage**: Academic papers from ArXiv in materials science and chemical physics
**Extraction Method**: Articles filtered by field using PaperScraper package for PDF download and processing
**Fields**:
- `fn`: ArXiv identifier (e.g., 10.48550_arXiv.0708.1447)
- `text`: Parsed text of the article
- `doi`: DOI of the article (if available)
- `title`: Article title
- `authors`: Article authors
- `index`: Document identifier
**Statistics**:
- Materials Science: 35.9M tokens across 5,940 documents
- Chemical Physics: 62.3M tokens across 6,750 documents
### 🧬 bioRxiv and medRxiv
**Sources**:
- [bioRxiv](https://www.biorxiv.org/) - Biology preprint repository
- [medRxiv](https://www.medrxiv.org/) - Health sciences preprint repository
**Coverage**: Preprints in biology and health sciences with chemistry relevance
**Extraction Method**: PaperScraper package for DOI-based retrieval, processed with Nougat for text extraction
**Fields**:
- `fn`: Unique identifier (e.g., 014597_file10)
- `text`: Full text content extracted via Nougat
**Statistics**:
- bioRxiv: 82.6M tokens across 60,300 documents
- medRxiv: 451M tokens across 15,800 documents
### ⚗️ ChemRxiv
**Source**: [ChemRxiv](https://chemrxiv.org/) - Preprint server for the global chemistry community
**Coverage**: Chemistry preprints from the community
**Extraction Method**: PaperScraper for DOI-based retrieval, processed with Nougat for text extraction
**Fields**:
- `fn`: Unique identifier (e.g., 10.26434_chemrxiv-2022-cgnf5)
- `text`: Full text content extracted via Nougat
- `doi`: DOI of the article (if available)
- `title`: Article title
- `authors`: Article authors
- `license`: Preprint license (e.g., CC BY-NC 4.0)
- `published_url`: Publication URL
- `index`: Document identifier
**Statistics**: 210M tokens across 28,900 documents
### 🔬 EuroPMC Filtered Papers
**Source**: [EuroPMC](https://europepmc.org/) - 27 million abstracts and 5 million full-text articles
**Coverage**: Chemistry-related scientific papers filtered from comprehensive medical literature
**Extraction Method**:
- BERT-based multilabel classifier trained on CAMEL datasets (20,000 examples per discipline)
- Validated against FineWebMath annotations (F1-score ~0.77 on 150 manually annotated entries)
- Analysis of first five 512-token chunks per document with 50-token overlaps
**Quality Control**:
- Postprocessing to remove non-chemical content (authors, acknowledgments, page numbers)
- Chemistry-specific content identification and filtering
**Fields**:
- `pmcid`: PubMed Central identifier
- `pmid`: PubMed identifier
- `topic`: Main classification topic (e.g., "Chemistry", "Physics", "Biology")
- `confidence`: Classification confidence score
- `class_distribution`: Multilabel classification distribution
- `text`: Full article text content
**Statistics**:
- Abstracts: 3.3B tokens across 10.4M documents
- Full Papers: 10B tokens across 1.2M documents
## 🚀 Quick Start
```python
from datasets import load_dataset, get_dataset_config_names
# List all available configurations
configs = get_dataset_config_names("jablonkagroup/chempile-paper")
print(f"Available configs: {configs}")
# ['arxiv-cond-mat.mtrl-sci_processed-default', 'arxiv-physics.chem-ph_processed-default',
# 'biorxiv_processed-default', 'chemrxiv_processed-default', 'euro_pmc_chemistry_abstracts-default',
# 'euro_pmc_chemistry_papers-default', 'medrxiv_processed-default']
# Load a specific subset
dataset = load_dataset("jablonkagroup/chempile-paper", name="arxiv-cond-mat.mtrl-sci_processed-default")
print(dataset)
# DatasetDict({
# train: Dataset({
# features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'],
# num_rows: 5899
# })
# test: Dataset({
# features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'],
# num_rows: 328
# })
# val: Dataset({
# features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'],
# num_rows: 328
# })
# })
# Access a sample
sample = dataset['train'][0]
print(f"Sample ID: {sample['fn']}")
print(f"Sample text: {sample['text'][:200]}...")
```
## 🎯 Use Cases
- **🤖 Language Model Training**: Pre-training or fine-tuning models for chemistry domain with cutting-edge research
- **🔬 Research Intelligence**: Building systems for scientific literature analysis and discovery
- **🔍 Information Retrieval**: Advanced chemistry knowledge base construction from research literature
- **📝 Content Generation**: Automated scientific writing and research synthesis
- **🧠 Domain Adaptation**: Adapting models to cutting-edge chemical research and terminology
## ⚠️ Limitations & Considerations
- **Language**: Primarily English (monolingual dataset)
- **Scope**: Focused on published research; may include technical jargon and advanced concepts
- **Quality**: Variable quality across sources; some OCR errors possible in older papers
- **Bias**: Reflects biases present in scientific publishing and academic literature
- **License**: No derivatives allowed due to CC BY-NC-ND 4.0 license
- **Recency**: Content reflects publication dates; cutting-edge developments may not be included
## 🛠️ Data Processing Pipeline
1. **Collection**: Automated scraping from academic repositories and databases
2. **Filtering**: BERT-based classification for chemistry relevance
3. **Extraction**: PDF processing with PaperScraper and Nougat OCR
4. **Quality Control**: Automated filtering and expert validation
5. **Standardization**: Consistent formatting and metadata extraction
6. **Validation**: Train/validation/test splits and quality checks
## 🏗️ ChemPile Collection
This dataset is part of the **ChemPile** collection, a comprehensive open dataset containing over 75 billion tokens of curated chemical data for training and evaluating general-purpose models in the chemical sciences.
### Collection Overview
- **📊 Scale**: 75+ billion tokens across multiple modalities
- **🧬 Modalities**: Structured representations (SMILES, SELFIES, IUPAC, InChI), scientific text, executable code, and molecular images
- **🎯 Design**: Integrates foundational educational knowledge with specialized scientific literature
- **🔬 Curation**: Extensive expert curation and validation
- **📈 Benchmarking**: Standardized train/validation/test splits for robust evaluation
- **🌐 Availability**: Openly released via Hugging Face
## 📄 Citation
If you use this dataset in your research, please cite:
```bibtex
@article{mirza2025chempile0,
title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models},
author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others},
year = {2025},
journal = {arXiv preprint arXiv:2505.12534}
}
```
## 👥 Contact & Support
- **Paper**: [arXiv:2505.12534](https://arxiv.org/abs/2505.12534)
- **Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/chempile-paper)
- **Issues**: Please report data issues or questions via the Hugging Face dataset page
---
<div align="center">

<i>Part of the ChemPile project - Advancing AI for Chemical Sciences</i>
</div>
# ChemPile-Paper
<div align="center">

[](https://huggingface.co/datasets/jablonkagroup/chempile-paper)
[](https://creativecommons.org/licenses/by-nc-nd/4.0/)
[](https://arxiv.org/abs/2505.12534)
*一份涵盖化学及相关领域学术论文与预印本的综合性科学文献合集*
</div>
## 📋 数据集概述
ChemPile-Paper可作为化学知识与推理前沿应用的研究资源,包含从多类学术仓储中精选的文献。本数据集为聚焦化学及相关领域的学术论文与预印本综合性合集。
### 📊 数据集统计
| 子集 | Token数 | 文档数 | 描述 |
|--------|--------|-----------|-------------|
| ArXiv 凝聚态物理-材料科学 | 35.9M | 5.94k | 材料科学学术论文 |
| ArXiv 物理-化学物理 | 62.3M | 6.75k | 化学物理学术论文 |
| bioRxiv | 82.6M | 60.3k | 生物学预印本 |
| medRxiv | 451M | 15.8k | 健康科学预印本 |
| ChemRxiv | 210M | 28.9k | 化学社区预印本 |
| EuroPMC 化学摘要 | 3.3B | 10.4M | 科学文献摘要 |
| EuroPMC 化学全文 | 10B | 1.2M | 科学文献全文 |
| **总计** | **~13.9B** | **~11.7M** | 化学领域科学文献 |
## 🗂️ 数据集配置
本数据集包含多个可作为Hugging Face配置的子集:
- `arxiv-cond-mat.mtrl-sci_processed-default`
- `arxiv-physics.chem-ph_processed-default`
- `biorxiv_processed-default`
- `chemrxiv_processed-default`
- `euro_pmc_chemistry_abstracts-default`
- `euro_pmc_chemistry_papers-default`
- `medrxiv_processed-default`
## 📜 许可证
所有内容均采用**CC BY-NC-ND 4.0**协议发布,该协议允许:
- ✅ 非商业使用
- ✅ 分享与再分发
- ⚠️ 需注明来源
- ❌ 禁止修改衍生作品
## 📖 数据集详情
### 📚 ArXiv 子集
**来源**:
- 凝聚态物理>材料科学(`arxiv-cond-mat.mtrl-sci_processed-default`)
- 物理>化学物理(`arxiv-physics.chem-ph_processed-default`)
**覆盖范围**:ArXiv平台上材料科学与化学物理领域的学术论文
**提取方法**:使用PaperScraper工具包按领域筛选论文,完成PDF下载与文本处理
**字段说明**:
- `fn`:ArXiv标识符(例如:10.48550_arXiv.0708.1447)
- `text`:文章解析后的文本内容
- `doi`:文章的DOI编号(若可用)
- `title`:文章标题
- `authors`:文章作者
- `index`:文档标识符
**统计数据**:
- 材料科学:5940份文档,总计35.9M Token
- 化学物理:6750份文档,总计62.3M Token
### 🧬 bioRxiv 与 medRxiv
**来源**:
- [bioRxiv](https://www.biorxiv.org/):生物学预印本仓储
- [medRxiv](https://www.medrxiv.org/):健康科学预印本仓储
**覆盖范围**:与化学相关的生物学与健康科学预印本
**提取方法**:使用PaperScraper工具包基于DOI检索,通过Nougat完成文本提取
**字段说明**:
- `fn`:唯一标识符(例如:014597_file10)
- `text`:通过Nougat提取的全文本内容
**统计数据**:
- bioRxiv:60300份文档,总计82.6M Token
- medRxiv:15800份文档,总计451M Token
### ⚗️ ChemRxiv
**来源**:[ChemRxiv](https://chemrxiv.org/):全球化学社区预印本服务器
**覆盖范围**:社区发布的化学领域预印本
**提取方法**:使用PaperScraper基于DOI检索,通过Nougat完成文本提取
**字段说明**:
- `fn`:唯一标识符(例如:10.26434_chemrxiv-2022-cgnf5)
- `text`:通过Nougat提取的全文本内容
- `doi`:文章的DOI编号(若可用)
- `title`:文章标题
- `authors`:文章作者
- `license`:预印本许可证(例如:CC BY-NC 4.0)
- `published_url`:发表链接
- `index`:文档标识符
**统计数据**:28900份文档,总计210M Token
### 🔬 EuroPMC 筛选论文
**来源**:[EuroPMC](https://europepmc.org/):包含2700万条摘要与500万条全文文章的数据库
**覆盖范围**:从综合医学文献中筛选的化学相关科学论文
**提取方法**:
- 使用基于BERT的多标签分类器,基于CAMEL数据集训练(每个学科20000条样本)
- 针对FineWebMath标注集进行验证,在150条手动标注样本上F1值约为0.77
- 对每篇文档的前5个512-Token块进行分析,块间重叠50个Token
**质量控制**:
- 后处理步骤移除非化学内容(作者信息、致谢、页码等)
- 化学特定内容识别与筛选
**字段说明**:
- `pmcid`:PubMed Central标识符
- `pmid`:PubMed标识符
- `topic`:主要分类主题(例如:"Chemistry"、"Physics"、"Biology")
- `confidence`:分类置信度得分
- `class_distribution`:多标签分类分布
- `text`:文章全文本内容
**统计数据**:
- 摘要:1040万份文档,总计3.3B Token
- 全文:120万份文档,总计10B Token
## 🚀 快速入门
python
from datasets import load_dataset, get_dataset_config_names
# 列出所有可用配置
configs = get_dataset_config_names("jablonkagroup/chempile-paper")
print(f"Available configs: {configs}")
# ['arxiv-cond-mat.mtrl-sci_processed-default', 'arxiv-physics.chem-ph_processed-default',
# 'biorxiv_processed-default', 'chemrxiv_processed-default', 'euro_pmc_chemistry_abstracts-default',
# 'euro_pmc_chemistry_papers-default', 'medrxiv_processed-default']
# 加载指定子集
dataset = load_dataset("jablonkagroup/chempile-paper", name="arxiv-cond-mat.mtrl-sci_processed-default")
print(dataset)
# DatasetDict({
# train: Dataset({
# features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'],
# num_rows: 5899
# })
# test: Dataset({
# features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'],
# num_rows: 328
# })
# val: Dataset({
# features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'],
# num_rows: 328
# })
# })
# 访问样本
sample = dataset['train'][0]
print(f"Sample ID: {sample['fn']}")
print(f"Sample text: {sample['text'][:200]}...")
## 🎯 应用场景
- **🤖 大语言模型(Large Language Model)训练**:用于化学领域前沿研究的大语言模型预训练或微调
- **🔬 研究情报分析**:构建科学文献分析与发现系统
- **🔍 信息检索**:从研究文献中构建高级化学知识库
- **📝 内容生成**:自动化科学写作与研究综述
- **🧠 领域适配**:使模型适配前沿化学研究与专业术语体系
## ⚠️ 局限性与考量因素
- **语言**:以英语为主(单语种数据集)
- **范围**:聚焦已发表研究成果,可能包含专业术语与高级学术概念
- **质量**:不同来源的文献质量参差不齐;老旧文献可能存在OCR识别错误
- **偏差**:反映学术出版与科学文献中存在的固有偏差
- **许可证限制**:因采用CC BY-NC-ND 4.0协议,禁止修改衍生作品
- **时效性**:内容基于文献发表日期,前沿研究进展可能未被收录
## 🛠️ 数据处理流程
1. **数据采集**:从学术仓储与数据库自动爬取文献
2. **内容筛选**:基于BERT的分类器筛选化学相关文献
3. **文本提取**:使用PaperScraper与Nougat OCR工具处理PDF文本
4. **质量控制**:自动筛选与专家验证流程
5. **格式标准化**:统一数据格式与元数据提取规则
6. **数据集划分**:完成训练/验证/测试集拆分与质量检查
## 🏗️ ChemPile 合集
本数据集隶属于**ChemPile**合集,这是一个综合性开源数据集,包含超过750亿Token的精选化学数据,用于训练与评估化学科学领域的通用模型。
### 合集概览
- **📊 规模**:覆盖多模态数据,总计750亿+ Token
- **🧬 数据模态**:结构化表示(SMILES、SELFIES、IUPAC、InChI)、科学文本、可执行代码与分子图像
- **🎯 设计理念**:整合基础化学教育知识与专业科学文献
- **🔬 精选流程**:经过严格的专家筛选与验证
- **📈 基准测试支持**:采用标准化的训练/验证/测试集划分,支持可靠的模型评估
- **🌐 发布渠道**:通过Hugging Face平台公开发布
## 📄 引用说明
若您在研究中使用本数据集,请引用以下文献:
bibtex
@article{mirza2025chempile0,
title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models},
author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others},
year = {2025},
journal = {arXiv preprint arXiv:2505.12534}
}
## 👥 联系与支持
- **学术论文**:[arXiv:2505.12534](https://arxiv.org/abs/2505.12534)
- **数据集主页**:[Hugging Face](https://huggingface.co/datasets/jablonkagroup/chempile-paper)
- **问题反馈**:请通过Hugging Face数据集页面提交数据相关问题或咨询
---
<div align="center">

<i>隶属于ChemPile项目——推动化学科学领域人工智能发展</i>
</div>
提供机构:
maas
创建时间:
2025-05-28



