five

chempile-code

收藏
魔搭社区2025-11-27 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/chempile-code
下载链接
链接失效反馈
官方服务:
资源简介:
# ChemPile-Code <div align="center"> ![ChemPile Logo](CHEMPILE_LOGO.png) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/chempile-code) [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2505.12534) [![Website](https://img.shields.io/badge/🌐-Website-green)](https://chempile.lamalab.org/) *A comprehensive collection of filtered scientific code from chemistry, biology, and materials science* </div> ## 📋 Dataset Summary ChemPile-Code includes filtered code from popular datasets such as the Stack and GitHub-code. It is designed to provide a rich source of scientific coding from fields such as chemistry, biology, and materials science. The dataset is part of the ChemPile project, and aims to create a comprehensive collection of chemistry code for training language models. The filtering process is keyword-based, focusing on packages and libraries relevant to chemistry, biology, and materials science. Those keywords include simulation packages such as LAMMPS, GROMACS, and OpenMM, as well as libraries like RDKit, ASE, and MDTraj, or plotting programmes like VMD or PyMOL. To avoid duplicates, exact hash matching was used to filter out identical code snippets. ### 📊 Dataset Statistics | Subset | Tokens | Documents | Description | |--------|--------|-----------|-------------| | CodeParrot GitHub-Code Chemistry Python | 1.8B | 208K | Python code from GitHub repositories | | StarCoder Chemistry | 16.1B | 2.06M | Python code from the Stack dataset | | **Total** | **~17.9B** | **~2.27M** | Scientific code snippets | ## 🗂️ Dataset Configurations The dataset includes different subsets available as Hugging Face configurations: - `codeparrot_github-code-chemistry-python-default` - `starcoder-chemistry-default` ## 📜 License All content is released under the **AGPL-3.0** license, which allows for: - ✅ Free use and distribution - ✅ Commercial use - ✅ Modification and derivatives - ⚠️ Attribution required However, the dataset combines code under different licenses. The config `codeparrot_github-code-chemistry-python-default` is designed such that is possible to filter the dataset based on the license. Therefore, this config has code under the next licenses: - MIT - GPL-3.0 - BSD-3-Clause - GPL-2.0 - Apache-2.0 - LGPL-2.1 - AGPL-2.0 - AGPL-3.0 - LGPL-3.0 - MPL-2.0 - BSD-2-Clause ## 📖 Dataset Details ### 📚 CodeParrot **Source**: CodeParrot is a subset of GitHub code, that we specifically filtered for chemistry-related content **Coverage**: Python code from the GitHub Code dataset **Extraction Method**: Keyword-based filtering focusing on chemistry, biology, and materials science packages and libraries **Fields**: - `text`: The code snippet - `repo_name`: The name of the repository where the code snippet was found - `path`: The path to the file within the repository - `language`: The programming language of the code snippet - `license`: The license of the repository - `size`: The size of the code snippet in bytes - `keyword`: A list of keywords that were used to filter the code snippet - `text_hash`: A hash of the code snippet to avoid duplicates **Statistics**: 208K code snippets with a total of over 1.8B tokens ### ⚗️ StarCoder **Source**: StarCoder is a subset of the Stack dataset, that we specifically filtered for chemistry-related content **Coverage**: Python code from the Stack dataset **Extraction Method**: Keyword-based filtering with exact hash matching to avoid duplicates **Fields**: - `text`: The code snippet - `repo_name`: The name of the repository where the code snippet was found - `keyword`: A list of keywords that were used to filter the code snippet - `text_hash`: A hash of the code snippet to avoid duplicates **Statistics**: 2.06M code snippets with a total of over 16.1B tokens ## 🚀 Quick Start ```python from datasets import load_dataset, get_dataset_config_names # Print available configs for the dataset configs = get_dataset_config_names("jablonkagroup/chempile-code") print(f"Available configs: {configs}") # Available configs: ['codeparrot_github-code-chemistry-python-default', 'starcoder-chemistry-default'] dataset = load_dataset("jablonkagroup/chempile-code", name=configs[0]) # Loading config: codeparrot_github-code-chemistry-python-default print(dataset) # DatasetDict({ # train: Dataset({ # features: ['text', 'repo_name', 'path', 'language', 'license', 'size', 'keyword', 'text_hash'], # num_rows: 186878 # }) # test: Dataset({ # features: ['text', 'repo_name', 'path', 'language', 'license', 'size', 'keyword', 'text_hash'], # num_rows: 10383 # }) # val: Dataset({ # features: ['text', 'repo_name', 'path', 'language', 'license', 'size', 'keyword', 'text_hash'], # num_rows: 10382 # }) # }) split_name = list(dataset.keys())[0] sample = dataset[split_name][0] print(sample) # { # 'text': 'import moogli except Exception as e:... # 'repo_name': 'BhallaLab/moose', # 'path': 'moose-examples/paper-2015/Fig2_elecModels/Fig2C.py', # 'language': 'Python', # 'license': 'gpl-3.0', # 'size': 14223, # 'keyword': ['MOOSE', 'NEURON'], # 'text_hash': '5eb6a5a439a675762a02c12cdff996e6a0d98f6ee874773cba2951727562aac5' # } ``` ## 🎯 Use Cases - **🤖 Code Generation**: Training models for scientific code generation and completion - **🔬 Scientific Computing**: Building systems for computational chemistry and materials science - **🔍 Code Search**: Advanced scientific code repository search and analysis - **📝 Documentation**: Automated code documentation and analysis for scientific software - **🧠 Domain Adaptation**: Adapting models to scientific computing paradigms and libraries ## ⚠️ Limitations & Considerations - **Language**: Primarily Python code (monolingual dataset) - **Scope**: Focused on scientific computing; may include domain-specific jargon and advanced concepts - **Quality**: Variable quality across sources; some code may be incomplete or contain errors - **Bias**: Reflects biases present in open-source scientific software development - **License**: Mixed licenses from source repositories - check individual `license` field - **Duplicates**: Hash-based deduplication applied but some semantic duplicates may remain ## 🛠️ Data Processing Pipeline 1. **Collection**: Automated extraction from GitHub-code and Stack datasets 2. **Filtering**: Keyword-based filtering for chemistry, biology, and materials science relevance 3. **Deduplication**: Exact hash matching to remove identical code snippets 4. **Quality Control**: Automated filtering and validation 5. **Standardization**: Consistent formatting and metadata extraction 6. **Validation**: Train/validation/test splits and quality checks ## 🏗️ ChemPile Collection This dataset is part of the **ChemPile** collection, a comprehensive open dataset containing over 75 billion tokens of curated chemical data for training and evaluating general-purpose models in the chemical sciences. ### Collection Overview - **📊 Scale**: 75+ billion tokens across multiple modalities - **🧬 Modalities**: Structured representations (SMILES, SELFIES, IUPAC, InChI), scientific text, executable code, and molecular images - **🎯 Design**: Integrates foundational educational knowledge with specialized scientific literature - **🔬 Curation**: Extensive expert curation and validation - **📈 Benchmarking**: Standardized train/validation/test splits for robust evaluation - **🌐 Availability**: Openly released via Hugging Face ## 📄 Citation If you use this dataset in your research, please cite: ```bibtex @article{mirza2025chempile0, title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models}, author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others}, year = {2025}, journal = {arXiv preprint arXiv:2505.12534} } ``` ## 👥 Contact & Support - **Paper**: [arXiv:2505.12534](https://arxiv.org/abs/2505.12534) - **Website**: [ChemPile Project](https://chempile.lamalab.org/) - **Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/chempile-code) - **Issues**: Please report data issues or questions via the Hugging Face dataset page --- <div align="center"> ![LamaLab logo](png-file.png) <i>Advancing the evaluation of AI systems in chemistry and materials science</i> </div>

# ChemPile-Code <div align="center"> ![ChemPile 标识](CHEMPILE_LOGO.png) [![数据集](https://img.shields.io/badge/🤗%20拥抱脸(Hugging Face)-数据集-yellow)](https://huggingface.co/datasets/jablonkagroup/chempile-code) [![许可证:Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![论文](https://img.shields.io/badge/📄-论文-red)](https://arxiv.org/abs/2505.12534) [![官网](https://img.shields.io/badge/🌐-官网-green)](https://chempile.lamalab.org/) *一个经过筛选的化学、生物与材料科学领域科学代码综合集合* </div> ## 📋 数据集概述 ChemPile-Code 包含从热门数据集(如Stack数据集与GitHub代码数据集)中筛选出的代码。本数据集旨在为化学、生物与材料科学领域提供丰富的科学代码资源,作为ChemPile项目的组成部分,其目标是构建一套全面的化学领域代码集合,用于大语言模型(Large Language Model)的训练。筛选过程基于关键词匹配,聚焦于与化学、生物及材料科学相关的软件包与库。相关关键词包括LAMMPS、GROMACS、OpenMM等模拟软件包,以及RDKit、ASE、MDTraj等工具库,还有VMD、PyMOL等可视化绘图程序。为避免重复,本数据集采用精确哈希匹配的方式过滤完全一致的代码片段。 ### 📊 数据集统计数据 | 子集名称 | 词元(Token)数 | 文档数 | 描述 | |--------|--------|-----------|-------------| | CodeParrot GitHub-Code Chemistry Python | 18亿 | 20.8万 | 来自GitHub仓库的Python代码 | | StarCoder Chemistry | 161亿 | 206万 | 来自Stack数据集的Python代码 | | **总计** | **约179亿** | **约227万** | 科学代码片段 | ## 🗂️ 数据集配置项 本数据集包含多个可作为拥抱脸(Hugging Face)配置项的子集: - `codeparrot_github-code-chemistry-python-default` - `starcoder-chemistry-default` ## 📜 许可证 本数据集所有内容均采用**AGPL-3.0**许可证发布,该许可证支持: - ✅ 免费使用与分发 - ✅ 商业使用 - ✅ 修改及衍生创作 - ⚠️ 需注明原作者 但本数据集整合了采用多种不同许可证的代码。其中`codeparrot_github-code-chemistry-python-default`配置项支持基于许可证对数据集进行筛选,该配置项包含以下许可证对应的代码: - MIT - GPL-3.0 - BSD-3-Clause - GPL-2.0 - Apache-2.0 - LGPL-2.1 - AGPL-2.0 - AGPL-3.0 - LGPL-3.0 - MPL-2.0 - BSD-2-Clause ## 📖 数据集详情 ### 📚 CodeParrot 子集 **数据来源**:CodeParrot 是GitHub代码数据集的一个子集,我们针对化学相关内容进行了专门筛选 **覆盖范围**:来自GitHub代码数据集的Python代码 **提取方式**:基于关键词匹配的筛选,聚焦于化学、生物与材料科学相关的软件包与库 **数据字段**: - `text`:代码片段内容 - `repo_name`:代码片段所属仓库名称 - `path`:代码文件在仓库中的路径 - `language`:代码所用编程语言 - `license`:仓库对应的许可证 - `size`:代码片段的字节大小 - `keyword`:用于筛选该代码片段的关键词列表 - `text_hash`:用于避免重复的代码片段哈希值 **统计数据**:包含20.8万条代码片段,总词元(Token)数超过18亿 ### ⚗️ StarCoder 子集 **数据来源**:StarCoder 是Stack数据集的一个子集,我们针对化学相关内容进行了专门筛选 **覆盖范围**:来自Stack数据集的Python代码 **提取方式**:基于关键词匹配的筛选,并采用精确哈希匹配以避免重复 **数据字段**: - `text`:代码片段内容 - `repo_name`:代码片段所属仓库名称 - `keyword`:用于筛选该代码片段的关键词列表 - `text_hash`:用于避免重复的代码片段哈希值 **统计数据**:包含206万条代码片段,总词元(Token)数超过161亿 ## 🚀 快速入门 python from datasets import load_dataset, get_dataset_config_names # 打印数据集的可用配置项 configs = get_dataset_config_names("jablonkagroup/chempile-code") print(f"可用配置项:{configs}") # 可用配置项:['codeparrot_github-code-chemistry-python-default', 'starcoder-chemistry-default'] dataset = load_dataset("jablonkagroup/chempile-code", name=configs[0]) # 正在加载配置项:codeparrot_github-code-chemistry-python-default print(dataset) # DatasetDict({ # train: Dataset({ # features: ['text', 'repo_name', 'path', 'language', 'license', 'size', 'keyword', 'text_hash'], # num_rows: 186878 # }) # test: Dataset({ # features: ['text', 'repo_name', 'path', 'language', 'license', 'size', 'keyword', 'text_hash'], # num_rows: 10383 # }) # val: Dataset({ # features: ['text', 'repo_name', 'path', 'language', 'license', 'size', 'keyword', 'text_hash'], # num_rows: 10382 # }) # }) split_name = list(dataset.keys())[0] sample = dataset[split_name][0] print(sample) # { # 'text': 'import moogli except Exception as e:... # 'repo_name': 'BhallaLab/moose', # 'path': 'moose-examples/paper-2015/Fig2_elecModels/Fig2C.py', # 'language': 'Python', # 'license': 'gpl-3.0', # 'size': 14223, # 'keyword': ['MOOSE', 'NEURON'], # 'text_hash': '5eb6a5a439a675762a02c12cdff996e6a0d98f6ee874773cba2951727562aac5' # } ## 🎯 应用场景 - **🤖 代码生成**:训练用于科学代码生成与补全的模型 - **🔬 科学计算**:构建计算化学与材料科学领域的计算系统 - **🔍 代码检索**:实现科学代码仓库的高级检索与分析 - **📝 文档生成**:为科学软件自动生成代码文档并开展分析 - **🧠 领域适配**:使模型适配科学计算范式与专用工具库 ## ⚠️ 局限性与注意事项 - **语言限制**:以Python代码为主(单语种数据集) - **覆盖范围**:聚焦于科学计算领域,可能包含领域专属术语与高级概念 - **代码质量**:不同来源的代码质量参差不齐,部分代码可能不完整或存在错误 - **偏差问题**:反映了开源科学软件开发过程中存在的固有偏差 - **许可证问题**:整合了多种不同许可证的源代码,请单独检查每条数据的`license`字段 - **重复问题**:已采用基于哈希的去重策略,但仍可能存在语义层面的重复代码 ## 🛠️ 数据处理流程 1. **数据采集**:从GitHub代码数据集与Stack数据集自动提取代码 2. **内容筛选**:基于关键词匹配,筛选与化学、生物及材料科学相关的代码 3. **去重处理**:采用精确哈希匹配移除完全一致的代码片段 4. **质量管控**:自动化筛选与验证流程 5. **格式标准化**:统一代码格式并提取元数据 6. **划分验证**:划分为训练集、验证集与测试集并开展质量检查 ## 🏗️ ChemPile 数据集集合 本数据集属于**ChemPile**数据集集合的一部分,该集合是一套全面的开源数据集,包含超过750亿词元(Token)的精选化学数据,用于训练与评估化学科学领域的通用模型。 ### 集合概述 - **📊 数据规模**:覆盖多模态数据,总词元(Token)数超过750亿 - **🧬 模态类型**:包含结构化表征(SMILES、SELFIES、IUPAC、InChI)、科学文本、可执行代码与分子图像 - **🎯 设计目标**:整合基础教育知识与专业科学文献 - **🔬 数据精选**:经过大量专家筛选与验证 - **📈 基准测试**:采用标准化的训练集、验证集与测试集划分,支持可靠的模型评估 - **🌐 开放获取**:通过拥抱脸(Hugging Face)平台公开发布 ## 📄 引用方式 若您在研究中使用本数据集,请引用以下文献: bibtex @article{mirza2025chempile0, title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models}, author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others}, year = {2025}, journal = {arXiv预印本 arXiv:2505.12534} } ## 👥 联系与支持 - **论文链接**:[arXiv:2505.12534](https://arxiv.org/abs/2505.12534) - **项目官网**:[ChemPile 项目](https://chempile.lamalab.org/) - **数据集页面**:[拥抱脸(Hugging Face)数据集页](https://huggingface.co/datasets/jablonkagroup/chempile-code) - **问题反馈**:请通过拥抱脸数据集页面提交数据相关问题或咨询 --- <div align="center"> ![LamaLab 标识](png-file.png) <i>致力于推进化学与材料科学领域AI系统的评估工作</i> </div>
提供机构:
maas
创建时间:
2025-05-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作