chempile-paper

Name: chempile-paper
Creator: maas
Published: 2025-12-05 16:36:30
License: 暂无描述

魔搭社区2025-12-05 更新2025-06-14 收录

下载链接：

https://modelscope.cn/datasets/jablonkagroup/chempile-paper

下载链接

链接失效反馈

官方服务：

资源简介：

# ChemPile-Paper <div align="center"> ![ChemPile Logo](CHEMPILE_LOGO.png) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/chempile-paper) [![License: CC BY-NC-ND 4.0](https://img.shields.io/badge/License-CC%20BY--NC--ND%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-nd/4.0/) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2505.12534) *A comprehensive collection of scientific literature spanning academic papers and preprints focused on chemistry and related fields* </div> ## 📋 Dataset Summary ChemPile-Paper serves as a resource for cutting-edge applications of chemical knowledge and reasoning, containing curated papers from diverse repositories. This dataset represents a comprehensive collection of scientific literature spanning academic papers and preprints, all focused on chemistry and related fields. ### 📊 Dataset Statistics | Subset | Tokens | Documents | Description | |--------|--------|-----------|-------------| | ArXiv Cond-Mat Materials Science | 35.9M | 5,94K | Materials science papers | | ArXiv Physics Chemical Physics | 62.3M | 6,75K | Chemical physics papers | | bioRxiv | 82.6M | 60,3K | Biology preprints | | medRxiv | 451M | 15,8K | Health sciences preprints | | ChemRxiv | 210M | 28,9K | Chemistry community preprints | | EuroPMC Chemistry Abstracts | 3.3B | 10.4M | Scientific literature abstracts | | EuroPMC Chemistry Papers | 10B | 1.2M | Scientific literature full articles | | **Total** | **~13.9B** | **~11.7M** | Chemical scientific literature | ## 🗂️ Dataset Configurations The dataset includes different subsets available as Hugging Face configurations: - `arxiv-cond-mat.mtrl-sci_processed-default` - `arxiv-physics.chem-ph_processed-default` - `biorxiv_processed-default` - `chemrxiv_processed-default` - `euro_pmc_chemistry_abstracts-default` - `euro_pmc_chemistry_papers-default` - `medrxiv_processed-default` ## 📜 License All content is released under the **CC BY-NC-ND 4.0** license, which allows for: - ✅ Non-commercial use - ✅ Sharing and redistribution - ⚠️ Attribution required - ❌ No derivatives allowed ## 📖 Dataset Details ### 📚 ArXiv Subsets **Sources**: - Condensed Matter > Materials Science (`arxiv-cond-mat.mtrl-sci_processed-default`) - Physics > Chemical Physics (`arxiv-physics.chem-ph_processed-default`) **Coverage**: Academic papers from ArXiv in materials science and chemical physics **Extraction Method**: Articles filtered by field using PaperScraper package for PDF download and processing **Fields**: - `fn`: ArXiv identifier (e.g., 10.48550_arXiv.0708.1447) - `text`: Parsed text of the article - `doi`: DOI of the article (if available) - `title`: Article title - `authors`: Article authors - `index`: Document identifier **Statistics**: - Materials Science: 35.9M tokens across 5,940 documents - Chemical Physics: 62.3M tokens across 6,750 documents ### 🧬 bioRxiv and medRxiv **Sources**: - [bioRxiv](https://www.biorxiv.org/) - Biology preprint repository - [medRxiv](https://www.medrxiv.org/) - Health sciences preprint repository **Coverage**: Preprints in biology and health sciences with chemistry relevance **Extraction Method**: PaperScraper package for DOI-based retrieval, processed with Nougat for text extraction **Fields**: - `fn`: Unique identifier (e.g., 014597_file10) - `text`: Full text content extracted via Nougat **Statistics**: - bioRxiv: 82.6M tokens across 60,300 documents - medRxiv: 451M tokens across 15,800 documents ### ⚗️ ChemRxiv **Source**: [ChemRxiv](https://chemrxiv.org/) - Preprint server for the global chemistry community **Coverage**: Chemistry preprints from the community **Extraction Method**: PaperScraper for DOI-based retrieval, processed with Nougat for text extraction **Fields**: - `fn`: Unique identifier (e.g., 10.26434_chemrxiv-2022-cgnf5) - `text`: Full text content extracted via Nougat - `doi`: DOI of the article (if available) - `title`: Article title - `authors`: Article authors - `license`: Preprint license (e.g., CC BY-NC 4.0) - `published_url`: Publication URL - `index`: Document identifier **Statistics**: 210M tokens across 28,900 documents ### 🔬 EuroPMC Filtered Papers **Source**: [EuroPMC](https://europepmc.org/) - 27 million abstracts and 5 million full-text articles **Coverage**: Chemistry-related scientific papers filtered from comprehensive medical literature **Extraction Method**: - BERT-based multilabel classifier trained on CAMEL datasets (20,000 examples per discipline) - Validated against FineWebMath annotations (F1-score ~0.77 on 150 manually annotated entries) - Analysis of first five 512-token chunks per document with 50-token overlaps **Quality Control**: - Postprocessing to remove non-chemical content (authors, acknowledgments, page numbers) - Chemistry-specific content identification and filtering **Fields**: - `pmcid`: PubMed Central identifier - `pmid`: PubMed identifier - `topic`: Main classification topic (e.g., "Chemistry", "Physics", "Biology") - `confidence`: Classification confidence score - `class_distribution`: Multilabel classification distribution - `text`: Full article text content **Statistics**: - Abstracts: 3.3B tokens across 10.4M documents - Full Papers: 10B tokens across 1.2M documents ## 🚀 Quick Start ```python from datasets import load_dataset, get_dataset_config_names # List all available configurations configs = get_dataset_config_names("jablonkagroup/chempile-paper") print(f"Available configs: {configs}") # ['arxiv-cond-mat.mtrl-sci_processed-default', 'arxiv-physics.chem-ph_processed-default', # 'biorxiv_processed-default', 'chemrxiv_processed-default', 'euro_pmc_chemistry_abstracts-default', # 'euro_pmc_chemistry_papers-default', 'medrxiv_processed-default'] # Load a specific subset dataset = load_dataset("jablonkagroup/chempile-paper", name="arxiv-cond-mat.mtrl-sci_processed-default") print(dataset) # DatasetDict({ # train: Dataset({ # features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'], # num_rows: 5899 # }) # test: Dataset({ # features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'], # num_rows: 328 # }) # val: Dataset({ # features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'], # num_rows: 328 # }) # }) # Access a sample sample = dataset['train'][0] print(f"Sample ID: {sample['fn']}") print(f"Sample text: {sample['text'][:200]}...") ``` ## 🎯 Use Cases - **🤖 Language Model Training**: Pre-training or fine-tuning models for chemistry domain with cutting-edge research - **🔬 Research Intelligence**: Building systems for scientific literature analysis and discovery - **🔍 Information Retrieval**: Advanced chemistry knowledge base construction from research literature - **📝 Content Generation**: Automated scientific writing and research synthesis - **🧠 Domain Adaptation**: Adapting models to cutting-edge chemical research and terminology ## ⚠️ Limitations & Considerations - **Language**: Primarily English (monolingual dataset) - **Scope**: Focused on published research; may include technical jargon and advanced concepts - **Quality**: Variable quality across sources; some OCR errors possible in older papers - **Bias**: Reflects biases present in scientific publishing and academic literature - **License**: No derivatives allowed due to CC BY-NC-ND 4.0 license - **Recency**: Content reflects publication dates; cutting-edge developments may not be included ## 🛠️ Data Processing Pipeline 1. **Collection**: Automated scraping from academic repositories and databases 2. **Filtering**: BERT-based classification for chemistry relevance 3. **Extraction**: PDF processing with PaperScraper and Nougat OCR 4. **Quality Control**: Automated filtering and expert validation 5. **Standardization**: Consistent formatting and metadata extraction 6. **Validation**: Train/validation/test splits and quality checks ## 🏗️ ChemPile Collection This dataset is part of the **ChemPile** collection, a comprehensive open dataset containing over 75 billion tokens of curated chemical data for training and evaluating general-purpose models in the chemical sciences. ### Collection Overview - **📊 Scale**: 75+ billion tokens across multiple modalities - **🧬 Modalities**: Structured representations (SMILES, SELFIES, IUPAC, InChI), scientific text, executable code, and molecular images - **🎯 Design**: Integrates foundational educational knowledge with specialized scientific literature - **🔬 Curation**: Extensive expert curation and validation - **📈 Benchmarking**: Standardized train/validation/test splits for robust evaluation - **🌐 Availability**: Openly released via Hugging Face ## 📄 Citation If you use this dataset in your research, please cite: ```bibtex @article{mirza2025chempile0, title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models}, author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others}, year = {2025}, journal = {arXiv preprint arXiv:2505.12534} } ``` ## 👥 Contact & Support - **Paper**: [arXiv:2505.12534](https://arxiv.org/abs/2505.12534) - **Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/chempile-paper) - **Issues**: Please report data issues or questions via the Hugging Face dataset page --- <div align="center"> ![ChemPile Logo](CHEMPILE_LOGO.png) <i>Part of the ChemPile project - Advancing AI for Chemical Sciences</i> </div>

# ChemPile-Paper <div align="center"> ![ChemPile Logo](CHEMPILE_LOGO.png) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/chempile-paper) [![License: CC BY-NC-ND 4.0](https://img.shields.io/badge/License-CC%20BY--NC--ND%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-nd/4.0/) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2505.12534) *一份涵盖化学及相关领域学术论文与预印本的综合性科学文献合集* </div> ## 📋 数据集概述 ChemPile-Paper可作为化学知识与推理前沿应用的研究资源，包含从多类学术仓储中精选的文献。本数据集为聚焦化学及相关领域的学术论文与预印本综合性合集。 ### 📊 数据集统计 | 子集 | Token数 | 文档数 | 描述 | |--------|--------|-----------|-------------| | ArXiv 凝聚态物理-材料科学 | 35.9M | 5.94k | 材料科学学术论文 | | ArXiv 物理-化学物理 | 62.3M | 6.75k | 化学物理学术论文 | | bioRxiv | 82.6M | 60.3k | 生物学预印本 | | medRxiv | 451M | 15.8k | 健康科学预印本 | | ChemRxiv | 210M | 28.9k | 化学社区预印本 | | EuroPMC 化学摘要 | 3.3B | 10.4M | 科学文献摘要 | | EuroPMC 化学全文 | 10B | 1.2M | 科学文献全文 | | **总计** | **~13.9B** | **~11.7M** | 化学领域科学文献 | ## 🗂️ 数据集配置本数据集包含多个可作为Hugging Face配置的子集： - `arxiv-cond-mat.mtrl-sci_processed-default` - `arxiv-physics.chem-ph_processed-default` - `biorxiv_processed-default` - `chemrxiv_processed-default` - `euro_pmc_chemistry_abstracts-default` - `euro_pmc_chemistry_papers-default` - `medrxiv_processed-default` ## 📜 许可证所有内容均采用**CC BY-NC-ND 4.0**协议发布，该协议允许： - ✅ 非商业使用 - ✅ 分享与再分发 - ⚠️ 需注明来源 - ❌ 禁止修改衍生作品 ## 📖 数据集详情 ### 📚 ArXiv 子集 **来源**： - 凝聚态物理>材料科学（`arxiv-cond-mat.mtrl-sci_processed-default`） - 物理>化学物理（`arxiv-physics.chem-ph_processed-default`） **覆盖范围**：ArXiv平台上材料科学与化学物理领域的学术论文 **提取方法**：使用PaperScraper工具包按领域筛选论文，完成PDF下载与文本处理 **字段说明**： - `fn`：ArXiv标识符（例如：10.48550_arXiv.0708.1447） - `text`：文章解析后的文本内容 - `doi`：文章的DOI编号（若可用） - `title`：文章标题 - `authors`：文章作者 - `index`：文档标识符 **统计数据**： - 材料科学：5940份文档，总计35.9M Token - 化学物理：6750份文档，总计62.3M Token ### 🧬 bioRxiv 与 medRxiv **来源**： - [bioRxiv](https://www.biorxiv.org/)：生物学预印本仓储 - [medRxiv](https://www.medrxiv.org/)：健康科学预印本仓储 **覆盖范围**：与化学相关的生物学与健康科学预印本 **提取方法**：使用PaperScraper工具包基于DOI检索，通过Nougat完成文本提取 **字段说明**： - `fn`：唯一标识符（例如：014597_file10） - `text`：通过Nougat提取的全文本内容 **统计数据**： - bioRxiv：60300份文档，总计82.6M Token - medRxiv：15800份文档，总计451M Token ### ⚗️ ChemRxiv **来源**：[ChemRxiv](https://chemrxiv.org/)：全球化学社区预印本服务器 **覆盖范围**：社区发布的化学领域预印本 **提取方法**：使用PaperScraper基于DOI检索，通过Nougat完成文本提取 **字段说明**： - `fn`：唯一标识符（例如：10.26434_chemrxiv-2022-cgnf5） - `text`：通过Nougat提取的全文本内容 - `doi`：文章的DOI编号（若可用） - `title`：文章标题 - `authors`：文章作者 - `license`：预印本许可证（例如：CC BY-NC 4.0） - `published_url`：发表链接 - `index`：文档标识符 **统计数据**：28900份文档，总计210M Token ### 🔬 EuroPMC 筛选论文 **来源**：[EuroPMC](https://europepmc.org/)：包含2700万条摘要与500万条全文文章的数据库 **覆盖范围**：从综合医学文献中筛选的化学相关科学论文 **提取方法**： - 使用基于BERT的多标签分类器，基于CAMEL数据集训练（每个学科20000条样本） - 针对FineWebMath标注集进行验证，在150条手动标注样本上F1值约为0.77 - 对每篇文档的前5个512-Token块进行分析，块间重叠50个Token **质量控制**： - 后处理步骤移除非化学内容（作者信息、致谢、页码等） - 化学特定内容识别与筛选 **字段说明**： - `pmcid`：PubMed Central标识符 - `pmid`：PubMed标识符 - `topic`：主要分类主题（例如："Chemistry"、"Physics"、"Biology"） - `confidence`：分类置信度得分 - `class_distribution`：多标签分类分布 - `text`：文章全文本内容 **统计数据**： - 摘要：1040万份文档，总计3.3B Token - 全文：120万份文档，总计10B Token ## 🚀 快速入门 python from datasets import load_dataset, get_dataset_config_names # 列出所有可用配置 configs = get_dataset_config_names("jablonkagroup/chempile-paper") print(f"Available configs: {configs}") # ['arxiv-cond-mat.mtrl-sci_processed-default', 'arxiv-physics.chem-ph_processed-default', # 'biorxiv_processed-default', 'chemrxiv_processed-default', 'euro_pmc_chemistry_abstracts-default', # 'euro_pmc_chemistry_papers-default', 'medrxiv_processed-default'] # 加载指定子集 dataset = load_dataset("jablonkagroup/chempile-paper", name="arxiv-cond-mat.mtrl-sci_processed-default") print(dataset) # DatasetDict({ # train: Dataset({ # features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'], # num_rows: 5899 # }) # test: Dataset({ # features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'], # num_rows: 328 # }) # val: Dataset({ # features: ['fn', 'text', 'doi', 'title', 'authors', '__index_level_0__'], # num_rows: 328 # }) # }) # 访问样本 sample = dataset['train'][0] print(f"Sample ID: {sample['fn']}") print(f"Sample text: {sample['text'][:200]}...") ## 🎯 应用场景 - **🤖 大语言模型（Large Language Model）训练**：用于化学领域前沿研究的大语言模型预训练或微调 - **🔬 研究情报分析**：构建科学文献分析与发现系统 - **🔍 信息检索**：从研究文献中构建高级化学知识库 - **📝 内容生成**：自动化科学写作与研究综述 - **🧠 领域适配**：使模型适配前沿化学研究与专业术语体系 ## ⚠️ 局限性与考量因素 - **语言**：以英语为主（单语种数据集） - **范围**：聚焦已发表研究成果，可能包含专业术语与高级学术概念 - **质量**：不同来源的文献质量参差不齐；老旧文献可能存在OCR识别错误 - **偏差**：反映学术出版与科学文献中存在的固有偏差 - **许可证限制**：因采用CC BY-NC-ND 4.0协议，禁止修改衍生作品 - **时效性**：内容基于文献发表日期，前沿研究进展可能未被收录 ## 🛠️ 数据处理流程 1. **数据采集**：从学术仓储与数据库自动爬取文献 2. **内容筛选**：基于BERT的分类器筛选化学相关文献 3. **文本提取**：使用PaperScraper与Nougat OCR工具处理PDF文本 4. **质量控制**：自动筛选与专家验证流程 5. **格式标准化**：统一数据格式与元数据提取规则 6. **数据集划分**：完成训练/验证/测试集拆分与质量检查 ## 🏗️ ChemPile 合集本数据集隶属于**ChemPile**合集，这是一个综合性开源数据集，包含超过750亿Token的精选化学数据，用于训练与评估化学科学领域的通用模型。 ### 合集概览 - **📊 规模**：覆盖多模态数据，总计750亿+ Token - **🧬 数据模态**：结构化表示（SMILES、SELFIES、IUPAC、InChI）、科学文本、可执行代码与分子图像 - **🎯 设计理念**：整合基础化学教育知识与专业科学文献 - **🔬 精选流程**：经过严格的专家筛选与验证 - **📈 基准测试支持**：采用标准化的训练/验证/测试集划分，支持可靠的模型评估 - **🌐 发布渠道**：通过Hugging Face平台公开发布 ## 📄 引用说明若您在研究中使用本数据集，请引用以下文献： bibtex @article{mirza2025chempile0, title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models}, author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others}, year = {2025}, journal = {arXiv preprint arXiv:2505.12534} } ## 👥 联系与支持 - **学术论文**：[arXiv:2505.12534](https://arxiv.org/abs/2505.12534) - **数据集主页**：[Hugging Face](https://huggingface.co/datasets/jablonkagroup/chempile-paper) - **问题反馈**：请通过Hugging Face数据集页面提交数据相关问题或咨询 --- <div align="center"> ![ChemPile Logo](CHEMPILE_LOGO.png) <i>隶属于ChemPile项目——推动化学科学领域人工智能发展</i> </div>

提供机构：

maas

创建时间：

2025-05-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集