five

chempile-paper

收藏
魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/chempile-paper
下载链接
链接失效反馈
官方服务:
资源简介:
# ChemPile-Paper <div align="center"> ![ChemPile Logo](CHEMPILE_LOGO.png) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/chempile-paper) [![License: CC BY-NC-ND 4.0](https://img.shields.io/badge/License-CC%20BY--NC--ND%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-nd/4.0/) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2505.12534) *A comprehensive collection of scientific literature spanning academic papers and preprints focused on chemistry and related fields* </div> ## 📋 Dataset Summary ChemPile-Paper serves as a resource for cutting-edge applications of chemical knowledge and reasoning, containing curated papers from diverse repositories. This dataset represents a comprehensive collection of scientific literature spanning academic papers and preprints, all focused on chemistry and related fields. ### 📊 Dataset Statistics | Subset | Tokens | Documents | Description | |--------|--------|-----------|-------------| | ArXiv Cond-Mat Materials Science | 35.9M | 5,94K | Materials science papers | | ArXiv Physics Chemical Physics | 62.3M | 6,75K | Chemical physics papers | | bioRxiv | 82.6M | 60,3K | Biology preprints | | medRxiv | 451M | 15,8K | Health sciences preprints | | ChemRxiv | 210M | 28,9K | Chemistry community preprints | | EuroPMC Chemistry Abstracts | 3.3B | 10.4M | Scientific literature abstracts | | EuroPMC Chemistry Papers | 10B | 1.2M | Scientific literature full articles | | **Total** | **~13.9B** | **~11.7M** | Chemical scientific literature | ## 🗂️ Dataset Configurations The dataset includes different subsets available as Hugging Face configurations: - `arxiv-cond-mat.mtrl-sci_processed-default` - `arxiv-physics.chem-ph_processed-default` - `biorxiv_processed-default` - `chemrxiv_processed-default` - `euro_pmc_chemistry_abstracts-default` - `euro_pmc_chemistry_papers-default` - `medrxiv_processed-default` ## 📜 License All content is released under the **CC BY-NC-ND 4.0** license, which allows for: - ✅ Non-commercial use - ✅ Sharing and redistribution - ⚠️ Attribution required - ❌ No derivatives allowed ## 📖 Dataset Details ### 📚 ArXiv Subsets **Sources**: - Condensed Matter > Materials Science (`arxiv-cond-mat.mtrl-sci_processed-default`) - Physics > Chemical Physics (`arxiv-physics.chem-ph_processed-default`) **Coverage**: Academic papers from ArXiv in materials science and chemical physics **Extraction Method**: Articles filtered by field using PaperScraper package for PDF download and processing **Fields**: - `fn`: ArXiv identifier (e.g., 10.48550_arXiv.0708.1447) - `text`: Parsed text of the article - `doi`: DOI of the article (if available) - `title`: Article title - `authors`: Article authors - `index`: Document identifier **Statistics**: - Materials Science: 35.9M tokens across 5,940 documents - Chemical Physics: 62.3M tokens across 6,750 documents ### 🧬 bioRxiv and medRxiv **Sources**: - [bioRxiv](https://www.biorxiv.org/) - Biology preprint repository - [medRxiv](https://www.medrxiv.org/) - Health sciences preprint repository **Coverage**: Preprints in biology and health sciences with chemistry relevance **Extraction Method**: PaperScraper package for DOI-based retrieval, processed with Nougat for text extraction **Fields**: - `fn`: Unique identifier (e.g., 014597_file10) - `text`: Full text content extracted via Nougat **Statistics**: - bioRxiv: 82.6M tokens across 60,300 documents - medRxiv: 451M tokens across 15,800 documents ### ⚗️ ChemRxiv **Source**: [ChemRxiv](https://chemrxiv.org/) - Preprint server for the global chemistry community **Coverage**: Chemistry preprints from the community **Extraction Method**: PaperScraper for DOI-based retrieval, processed with Nougat for text extraction **Fields**: - `fn`: Unique identifier (e.g., 10.26434_chemrxiv-2022-cgnf5) - `text`: Full text content extracted via Nougat - `doi`: DOI of the article (if available) - `title`: Article title - `authors`: Article authors - `license`: Preprint license (e.g., CC BY-NC 4.0) - `published_url`: Publication URL - `index`: Document identifier **Statistics**: 210M tokens across 28,900 documents ### 🔬 EuroPMC Filtered Papers **Source**: [EuroPMC](https://europepmc.org/) - 27 million abstracts and 5 million full-text articles **Coverage**: Chemistry-related scientific papers filtered from comprehensive medical literature **Extraction Method**: - BERT-based multilabel classifier trained on CAMEL datasets (20,000 examples per discipline) - Validated against FineWebMath annotations (F1-score ~0.77 on 150 manually annotated entries) - Analysis of first five 512-token chunks per document with 50-token overlaps **Quality Control**: - Postprocessing to remove non-chemical content (authors, acknowledgments, page numbers) - Chemistry-specific content identification and filtering **Fields**: - `pmcid`: PubMed Central identifier - `pmid`: PubMed identifier - `topic`: Main classification topic (e.g., "Chemistry", "Physics", "Biology") - `confidence`: Classification confidence score - `class_distribution`: Multilabel classification distribution - `text`: Full article text content **Statistics**: - Abstracts: 3.3B tokens across 10.4M documents - Full Papers: 10B tokens across 1.2M documents ## 🚀 Quick Start ```python from datasets import load_dataset, get_dataset_config_names # List all available configurations configs = get_dataset_config_names("jablonkagroup/chempile-paper") print(f"Available configs: {configs}") # ['arxiv-cond-mat.mtrl-sci_processed-default', 'arxiv-physics.chem-ph_processed-default', # 'biorxiv_processed-default', 'chemrxiv_processed-default', 'euro_pmc_chemistry_abstracts-default', # 'euro_pmc_chemistry_papers-default', 'medrxiv_processed-default'] # Load a specific subset dataset = load_dataset("jablonkagroup/chempile-paper", name="arxiv-cond-mat.mtrl-sci_processed-default") print(dataset) # DatasetDict({ # train: Dataset({ # features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'], # num_rows: 5899 # }) # test: Dataset({ # features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'], # num_rows: 328 # }) # val: Dataset({ # features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'], # num_rows: 328 # }) # }) # Access a sample sample = dataset['train'][0] print(f"Sample ID: {sample['fn']}") print(f"Sample text: {sample['text'][:200]}...") ``` ## 🎯 Use Cases - **🤖 Language Model Training**: Pre-training or fine-tuning models for chemistry domain with cutting-edge research - **🔬 Research Intelligence**: Building systems for scientific literature analysis and discovery - **🔍 Information Retrieval**: Advanced chemistry knowledge base construction from research literature - **📝 Content Generation**: Automated scientific writing and research synthesis - **🧠 Domain Adaptation**: Adapting models to cutting-edge chemical research and terminology ## ⚠️ Limitations & Considerations - **Language**: Primarily English (monolingual dataset) - **Scope**: Focused on published research; may include technical jargon and advanced concepts - **Quality**: Variable quality across sources; some OCR errors possible in older papers - **Bias**: Reflects biases present in scientific publishing and academic literature - **License**: No derivatives allowed due to CC BY-NC-ND 4.0 license - **Recency**: Content reflects publication dates; cutting-edge developments may not be included ## 🛠️ Data Processing Pipeline 1. **Collection**: Automated scraping from academic repositories and databases 2. **Filtering**: BERT-based classification for chemistry relevance 3. **Extraction**: PDF processing with PaperScraper and Nougat OCR 4. **Quality Control**: Automated filtering and expert validation 5. **Standardization**: Consistent formatting and metadata extraction 6. **Validation**: Train/validation/test splits and quality checks ## 🏗️ ChemPile Collection This dataset is part of the **ChemPile** collection, a comprehensive open dataset containing over 75 billion tokens of curated chemical data for training and evaluating general-purpose models in the chemical sciences. ### Collection Overview - **📊 Scale**: 75+ billion tokens across multiple modalities - **🧬 Modalities**: Structured representations (SMILES, SELFIES, IUPAC, InChI), scientific text, executable code, and molecular images - **🎯 Design**: Integrates foundational educational knowledge with specialized scientific literature - **🔬 Curation**: Extensive expert curation and validation - **📈 Benchmarking**: Standardized train/validation/test splits for robust evaluation - **🌐 Availability**: Openly released via Hugging Face ## 📄 Citation If you use this dataset in your research, please cite: ```bibtex @article{mirza2025chempile0, title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models}, author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others}, year = {2025}, journal = {arXiv preprint arXiv:2505.12534} } ``` ## 👥 Contact & Support - **Paper**: [arXiv:2505.12534](https://arxiv.org/abs/2505.12534) - **Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/chempile-paper) - **Issues**: Please report data issues or questions via the Hugging Face dataset page --- <div align="center"> ![ChemPile Logo](CHEMPILE_LOGO.png) <i>Part of the ChemPile project - Advancing AI for Chemical Sciences</i> </div>

# ChemPile-Paper <div align="center"> ![ChemPile Logo](CHEMPILE_LOGO.png) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/chempile-paper) [![License: CC BY-NC-ND 4.0](https://img.shields.io/badge/License-CC%20BY--NC--ND%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-nd/4.0/) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2505.12534) *一份涵盖化学及相关领域学术论文与预印本的综合性科学文献合集* </div> ## 📋 数据集概述 ChemPile-Paper可作为化学知识与推理前沿应用的研究资源,包含从多类学术仓储中精选的文献。本数据集为聚焦化学及相关领域的学术论文与预印本综合性合集。 ### 📊 数据集统计 | 子集 | Token数 | 文档数 | 描述 | |--------|--------|-----------|-------------| | ArXiv 凝聚态物理-材料科学 | 35.9M | 5.94k | 材料科学学术论文 | | ArXiv 物理-化学物理 | 62.3M | 6.75k | 化学物理学术论文 | | bioRxiv | 82.6M | 60.3k | 生物学预印本 | | medRxiv | 451M | 15.8k | 健康科学预印本 | | ChemRxiv | 210M | 28.9k | 化学社区预印本 | | EuroPMC 化学摘要 | 3.3B | 10.4M | 科学文献摘要 | | EuroPMC 化学全文 | 10B | 1.2M | 科学文献全文 | | **总计** | **~13.9B** | **~11.7M** | 化学领域科学文献 | ## 🗂️ 数据集配置 本数据集包含多个可作为Hugging Face配置的子集: - `arxiv-cond-mat.mtrl-sci_processed-default` - `arxiv-physics.chem-ph_processed-default` - `biorxiv_processed-default` - `chemrxiv_processed-default` - `euro_pmc_chemistry_abstracts-default` - `euro_pmc_chemistry_papers-default` - `medrxiv_processed-default` ## 📜 许可证 所有内容均采用**CC BY-NC-ND 4.0**协议发布,该协议允许: - ✅ 非商业使用 - ✅ 分享与再分发 - ⚠️ 需注明来源 - ❌ 禁止修改衍生作品 ## 📖 数据集详情 ### 📚 ArXiv 子集 **来源**: - 凝聚态物理>材料科学(`arxiv-cond-mat.mtrl-sci_processed-default`) - 物理>化学物理(`arxiv-physics.chem-ph_processed-default`) **覆盖范围**:ArXiv平台上材料科学与化学物理领域的学术论文 **提取方法**:使用PaperScraper工具包按领域筛选论文,完成PDF下载与文本处理 **字段说明**: - `fn`:ArXiv标识符(例如:10.48550_arXiv.0708.1447) - `text`:文章解析后的文本内容 - `doi`:文章的DOI编号(若可用) - `title`:文章标题 - `authors`:文章作者 - `index`:文档标识符 **统计数据**: - 材料科学:5940份文档,总计35.9M Token - 化学物理:6750份文档,总计62.3M Token ### 🧬 bioRxiv 与 medRxiv **来源**: - [bioRxiv](https://www.biorxiv.org/):生物学预印本仓储 - [medRxiv](https://www.medrxiv.org/):健康科学预印本仓储 **覆盖范围**:与化学相关的生物学与健康科学预印本 **提取方法**:使用PaperScraper工具包基于DOI检索,通过Nougat完成文本提取 **字段说明**: - `fn`:唯一标识符(例如:014597_file10) - `text`:通过Nougat提取的全文本内容 **统计数据**: - bioRxiv:60300份文档,总计82.6M Token - medRxiv:15800份文档,总计451M Token ### ⚗️ ChemRxiv **来源**:[ChemRxiv](https://chemrxiv.org/):全球化学社区预印本服务器 **覆盖范围**:社区发布的化学领域预印本 **提取方法**:使用PaperScraper基于DOI检索,通过Nougat完成文本提取 **字段说明**: - `fn`:唯一标识符(例如:10.26434_chemrxiv-2022-cgnf5) - `text`:通过Nougat提取的全文本内容 - `doi`:文章的DOI编号(若可用) - `title`:文章标题 - `authors`:文章作者 - `license`:预印本许可证(例如:CC BY-NC 4.0) - `published_url`:发表链接 - `index`:文档标识符 **统计数据**:28900份文档,总计210M Token ### 🔬 EuroPMC 筛选论文 **来源**:[EuroPMC](https://europepmc.org/):包含2700万条摘要与500万条全文文章的数据库 **覆盖范围**:从综合医学文献中筛选的化学相关科学论文 **提取方法**: - 使用基于BERT的多标签分类器,基于CAMEL数据集训练(每个学科20000条样本) - 针对FineWebMath标注集进行验证,在150条手动标注样本上F1值约为0.77 - 对每篇文档的前5个512-Token块进行分析,块间重叠50个Token **质量控制**: - 后处理步骤移除非化学内容(作者信息、致谢、页码等) - 化学特定内容识别与筛选 **字段说明**: - `pmcid`:PubMed Central标识符 - `pmid`:PubMed标识符 - `topic`:主要分类主题(例如:"Chemistry"、"Physics"、"Biology") - `confidence`:分类置信度得分 - `class_distribution`:多标签分类分布 - `text`:文章全文本内容 **统计数据**: - 摘要:1040万份文档,总计3.3B Token - 全文:120万份文档,总计10B Token ## 🚀 快速入门 python from datasets import load_dataset, get_dataset_config_names # 列出所有可用配置 configs = get_dataset_config_names("jablonkagroup/chempile-paper") print(f"Available configs: {configs}") # ['arxiv-cond-mat.mtrl-sci_processed-default', 'arxiv-physics.chem-ph_processed-default', # 'biorxiv_processed-default', 'chemrxiv_processed-default', 'euro_pmc_chemistry_abstracts-default', # 'euro_pmc_chemistry_papers-default', 'medrxiv_processed-default'] # 加载指定子集 dataset = load_dataset("jablonkagroup/chempile-paper", name="arxiv-cond-mat.mtrl-sci_processed-default") print(dataset) # DatasetDict({ # train: Dataset({ # features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'], # num_rows: 5899 # }) # test: Dataset({ # features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'], # num_rows: 328 # }) # val: Dataset({ # features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'], # num_rows: 328 # }) # }) # 访问样本 sample = dataset['train'][0] print(f"Sample ID: {sample['fn']}") print(f"Sample text: {sample['text'][:200]}...") ## 🎯 应用场景 - **🤖 大语言模型(Large Language Model)训练**:用于化学领域前沿研究的大语言模型预训练或微调 - **🔬 研究情报分析**:构建科学文献分析与发现系统 - **🔍 信息检索**:从研究文献中构建高级化学知识库 - **📝 内容生成**:自动化科学写作与研究综述 - **🧠 领域适配**:使模型适配前沿化学研究与专业术语体系 ## ⚠️ 局限性与考量因素 - **语言**:以英语为主(单语种数据集) - **范围**:聚焦已发表研究成果,可能包含专业术语与高级学术概念 - **质量**:不同来源的文献质量参差不齐;老旧文献可能存在OCR识别错误 - **偏差**:反映学术出版与科学文献中存在的固有偏差 - **许可证限制**:因采用CC BY-NC-ND 4.0协议,禁止修改衍生作品 - **时效性**:内容基于文献发表日期,前沿研究进展可能未被收录 ## 🛠️ 数据处理流程 1. **数据采集**:从学术仓储与数据库自动爬取文献 2. **内容筛选**:基于BERT的分类器筛选化学相关文献 3. **文本提取**:使用PaperScraper与Nougat OCR工具处理PDF文本 4. **质量控制**:自动筛选与专家验证流程 5. **格式标准化**:统一数据格式与元数据提取规则 6. **数据集划分**:完成训练/验证/测试集拆分与质量检查 ## 🏗️ ChemPile 合集 本数据集隶属于**ChemPile**合集,这是一个综合性开源数据集,包含超过750亿Token的精选化学数据,用于训练与评估化学科学领域的通用模型。 ### 合集概览 - **📊 规模**:覆盖多模态数据,总计750亿+ Token - **🧬 数据模态**:结构化表示(SMILES、SELFIES、IUPAC、InChI)、科学文本、可执行代码与分子图像 - **🎯 设计理念**:整合基础化学教育知识与专业科学文献 - **🔬 精选流程**:经过严格的专家筛选与验证 - **📈 基准测试支持**:采用标准化的训练/验证/测试集划分,支持可靠的模型评估 - **🌐 发布渠道**:通过Hugging Face平台公开发布 ## 📄 引用说明 若您在研究中使用本数据集,请引用以下文献: bibtex @article{mirza2025chempile0, title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models}, author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others}, year = {2025}, journal = {arXiv preprint arXiv:2505.12534} } ## 👥 联系与支持 - **学术论文**:[arXiv:2505.12534](https://arxiv.org/abs/2505.12534) - **数据集主页**:[Hugging Face](https://huggingface.co/datasets/jablonkagroup/chempile-paper) - **问题反馈**:请通过Hugging Face数据集页面提交数据相关问题或咨询 --- <div align="center"> ![ChemPile Logo](CHEMPILE_LOGO.png) <i>隶属于ChemPile项目——推动化学科学领域人工智能发展</i> </div>
提供机构:
maas
创建时间:
2025-05-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作