chempile-instruction
收藏魔搭社区2025-12-05 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/chempile-instruction
下载链接
链接失效反馈官方服务:
资源简介:
# ChemPile-Instruction
<div align="center">

[](https://huggingface.co/datasets/jablonkagroup/chempile-instruction)
[](https://creativecommons.org/licenses/by/4.0/)
[](https://arxiv.org/abs/2505.12534)
[](https://chempile.lamalab.org/)
*A comprehensive instruction tuning dataset for chemistry LLMs with multi-turn conversations and diverse reasoning tasks*
</div>
## 📋 Dataset Summary
ChemPile-Instruction is a text-only dataset designed for instruction tuning of Large Language Models (LLMs) in the field of chemistry. It contains high-quality multi-turn conversations, each rephrased from different educational, scientific, and reasoning sources using diverse prompting strategies. The conversations were generated using `gpt-4o-mini-2024-07-18` with various prompts to ensure a diverse range of topics and conversational styles.
### 📊 Dataset Statistics
| Configuration | Description | Content Source |
|---------------|-------------|----------------|
| chempile-education | Educational conversations | Textbooks and online courses |
| chempile-paper-100m | Research paper discussions | Academic papers and publications |
| chempile-reasoning | Chemistry reasoning tasks | Logic and deduction problems |
## 🗂️ Dataset Configurations
The dataset includes three distinct subsets available as Hugging Face configurations:
- `chempile-education`: Educational conversations suitable for learners at various levels
- `chempile-paper-100m`: Research-focused discussions for advanced learners and researchers
- `chempile-reasoning`: Reasoning tasks to improve logical thinking skills in chemistry
## 📜 License
All content is released under the **CC BY 4.0** license, which allows for:
- ✅ Commercial and non-commercial use
- ✅ Sharing and redistribution
- ✅ Adaptation and modification
- ⚠️ Attribution required
## 📖 Dataset Details
### 🎯 Data Fields
Each conversation in the dataset contains:
- **`messages`** (list): Conversation messages in [LiteLLM format](https://docs.litellm.ai/docs/completion/prompt_formatting)
- `role` (str): Message sender role ("user" or "assistant")
- `content` (str): Message content
- **`first_tag`** (list): Required skills for the conversation:
- `requires-knowledge`: Factual or domain-specific knowledge
- `requires-calculation`: Mathematical or computational reasoning
- `requires-reasoning`: Logical or deductive reasoning
- **`second_tag`** (list): Chemistry subdomains covered:
- `Analytical Chemistry`
- `General Chemistry`
- `Inorganic Chemistry`
- `Materials Science`
- `Organic Chemistry`
- `Physical Chemistry`
- `Technical Chemistry`
- **`origin`** (dict): Generation metadata:
- `dataset`: Original source dataset name
- `config`: Configuration used for generation
- `split`: Original dataset split
- `prompt`: Prompt type used (`engaging`, `hard`, `wiki`, `none`)
### 🎨 Generation Strategy
The dataset uses four distinct prompting approaches based on [Pieler et al.](https://doi.org/10.48550/arXiv.2410.20796):
- **🎪 Engaging**: Elicits detailed explanations and insights with engaging tone
- **🧠 Hard**: Enforces esoteric, complex vocabulary for advanced concepts
- **📚 Wiki**: Provides encyclopedic, factual responses similar to Wikipedia
- **🔄 None**: Natural model responses without specific style constraints
This diversity ensures comprehensive coverage of chemistry topics and conversational styles.
### 📊 Quality Control
- Based on [ChemBench work](https://chembench.lamalab.org/) taxonomies
- Multi-turn conversation structure for contextual learning
- Diverse prompting strategies for varied response styles
- Comprehensive skill and subdomain tagging system
## 🚀 Quick Start
```python
from datasets import load_dataset, get_dataset_config_names
# Print available configs for the dataset
configs = get_dataset_config_names("jablonkagroup/chempile-instruction")
print(f"Available configs: {configs}")
# Available configs: ['chempile-education', 'chempile-paper-100m', 'chempile-reasoning']
dataset = load_dataset("jablonkagroup/chempile-instruction", name=configs[0])
# Loading config: chempile-education
print(dataset)
# DatasetDict({
# train: Dataset({
# features: ['first_tag', 'second_tag', 'origin', 'messages'],
# num_rows: 60171
# })
# test: Dataset({
# features: ['first_tag', 'second_tag', 'origin', 'messages'],
# num_rows: 3343
# })
# val: Dataset({
# features: ['first_tag', 'second_tag', 'origin', 'messages'],
# num_rows: 3344
# })
# })
split_name = list(dataset.keys())[0]
sample = dataset[split_name][0]
# print(sample)
# {
# 'first_tag': [],
# 'second_tag': [],
# 'origin': {
# 'config': 'LibreText_Chemistry-default',
# 'dataset': 'jablonkagroup/chempile-education',
# 'prompt_type': 'engaging',
# 'split': 'train'
# },
# 'messages': [
# {
# 'content': 'Can you explain what a hydrogen bond is in chemistry?',
# 'role': 'user'
# },
# {
# 'content': 'Sure!...
# 'role': 'assistant'
# }
# ... more messages
# ]
# }
```
## 🎯 Use Cases
- **🤖 Instruction Tuning**: Fine-tuning LLMs for chemistry-specific conversations
- **💬 Conversational AI**: Building chemistry chatbots and virtual assistants
- **📚 Educational Systems**: Developing interactive chemistry tutoring platforms
- **🔬 Research Support**: Creating AI assistants for chemistry researchers
- **🧠 Reasoning Enhancement**: Training models for chemical problem-solving
- **📝 Multi-turn Dialogue**: Learning contextual conversation patterns
## ⚠️ Limitations & Considerations
- **Language**: English only (monolingual dataset)
- **Generation**: AI-generated content may contain inaccuracies
- **Scope**: Covers educational and research chemistry but not industrial applications
- **Bias**: May reflect biases from source materials and generation models
- **Context**: Multi-turn format requires proper conversation handling
- **Evaluation**: Generated content should be validated for factual accuracy
## 🛠️ Data Processing Pipeline
1. **Source Collection**: Gathering from educational, scientific, and reasoning sources
2. **Prompt Design**: Creating four distinct prompting strategies for diversity
3. **Generation**: Using `gpt-4o-mini-2024-07-18` for conversation creation
4. **Tagging**: Applying skill and subdomain classification systems
5. **Quality Control**: Filtering and validation of generated conversations
6. **Formatting**: Standardizing to LiteLLM conversation format
7. **Splitting**: Creating train/validation/test splits for evaluation
## 🏗️ ChemPile Collection
This dataset is part of the **ChemPile** collection, a comprehensive open dataset containing over 75 billion tokens of curated chemical data for training and evaluating general-purpose models in the chemical sciences.
### Collection Overview
- **📊 Scale**: 75+ billion tokens across multiple modalities
- **🧬 Modalities**: Structured representations (SMILES, SELFIES, IUPAC, InChI), scientific text, executable code, and molecular images
- **🎯 Design**: Integrates foundational educational knowledge with specialized scientific literature
- **🔬 Curation**: Extensive expert curation and validation
- **📈 Benchmarking**: Standardized train/validation/test splits for robust evaluation
- **🌐 Availability**: Openly released via Hugging Face
## 📄 Citation
If you use this dataset in your research, please cite:
```bibtex
@article{mirza2025chempile0,
title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models},
author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others},
year = {2025},
journal = {arXiv preprint arXiv:2505.12534}
}
```
## 👥 Contact & Support
- **Paper**: [arXiv:2505.12534](https://arxiv.org/abs/2505.12534)
- **Website**: [ChemPile Project](https://chempile.lamalab.org/)
- **Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/chempile-instruction)
- **Issues**: Please report data issues or questions via the Hugging Face dataset page
---
<div align="center">

<i>Part of the ChemPile project - Advancing AI for Chemical Sciences</i>
</div>
# ChemPile-Instruction
<div align="center">

[](https://huggingface.co/datasets/jablonkagroup/chempile-instruction)
[](https://creativecommons.org/licenses/by/4.0/)
[](https://arxiv.org/abs/2505.12534)
[](https://chempile.lamalab.org/)
*面向化学领域大语言模型(Large Language Model, LLM)的多轮对话与多样化推理任务指令微调综合数据集*
</div>
## 📋 数据集摘要
ChemPile-Instruction是专为化学领域大语言模型指令微调设计的纯文本数据集。该数据集包含高质量多轮对话,所有对话均通过多样化提示策略,从不同教育、科研及推理类源文本改写而来。对话由`gpt-4o-mini-2024-07-18`结合各类提示生成,以确保覆盖多样的主题与对话风格。
### 📊 数据集统计
| 配置名称 | 描述 | 内容来源 |
|---------------|-------------|----------------|
| chempile-education | 教育类对话 | 教科书与在线课程 |
| chempile-paper-100m | 科研论文讨论 | 学术论文与出版物 |
| chempile-reasoning | 化学推理任务 | 逻辑与演绎问题 |
## 🗂️ 数据集配置
该数据集包含三个独立子集,可通过Hugging Face配置调用:
- `chempile-education`:面向各水平学习者的教育类对话
- `chempile-paper-100m`:面向进阶学习者与科研人员的科研主题讨论
- `chempile-reasoning`:用于提升化学领域逻辑思维能力的推理任务
## 📜 授权协议
所有内容采用**CC BY 4.0(知识共享署名4.0)**协议发布,允许:
- ✅ 商业与非商业使用
- ✅ 分享与再分发
- ✅ 改编与修改
- ⚠️ 需注明原作者
## 📖 数据集详情
### 🎯 数据字段
每个对话包含以下字段:
- **`messages`**(列表):采用[LiteLLM格式](https://docs.litellm.ai/docs/completion/prompt_formatting)的对话消息列表
- `role`(字符串):消息发送者角色("user"即用户或"assistant"即助手)
- `content`(字符串):消息内容
- **`first_tag`**(列表):对话所需技能标签:
- `requires-knowledge`:需事实性或领域专业知识
- `requires-calculation`:需数学或计算推理
- `requires-reasoning`:需逻辑或演绎推理
- **`second_tag`**(列表):覆盖的化学子领域:
- `Analytical Chemistry`(分析化学)
- `General Chemistry`(普通化学)
- `Inorganic Chemistry`(无机化学)
- `Materials Science`(材料科学)
- `Organic Chemistry`(有机化学)
- `Physical Chemistry`(物理化学)
- `Technical Chemistry`(技术化学)
- **`origin`**(字典):生成元数据:
- `dataset`:原始源数据集名称
- `config`:生成所用配置
- `split`:原始数据集划分
- `prompt`:所用提示类型(`engaging`即互动式、`hard`即高阶难度、`wiki`即维基百科式、`none`即无约束)
### 🎨 生成策略
基于[Pieler等人的研究](https://doi.org/10.48550/arXiv.2410.20796),数据集采用四种差异化提示方法:
- **🎪 互动式(Engaging)**:以互动语调引导详细解释与见解
- **🧠 高阶难度(Hard)**:针对进阶概念使用晦涩、专业的词汇
- **📚 维基百科式(Wiki)**:提供类似维基百科的百科式事实性回复
- **🔄 无约束(None)**:无特定风格限制的自然模型回复
这种多样性确保了化学主题与对话风格的全面覆盖。
### 📊 质量控制
- 基于[ChemBench](https://chembench.lamalab.org/)分类体系
- 采用多轮对话结构以支持情境化学习
- 多样化提示策略实现多样化回复风格
- 完善的技能与子领域标签系统
## 🚀 快速上手
python
from datasets import load_dataset, get_dataset_config_names
# 打印数据集可用的配置项
configs = get_dataset_config_names("jablonkagroup/chempile-instruction")
print(f"可用配置项: {configs}")
# 输出示例: 可用配置项: ['chempile-education', 'chempile-paper-100m', 'chempile-reasoning']
dataset = load_dataset("jablonkagroup/chempile-instruction", name=configs[0])
# 加载配置: chempile-education
print(dataset)
# 输出示例:
# DatasetDict({
# train: Dataset({
# features: ['first_tag', 'second_tag', 'origin', 'messages'],
# num_rows: 60171
# })
# test: Dataset({
# features: ['first_tag', 'second_tag', 'origin', 'messages'],
# num_rows: 3343
# })
# val: Dataset({
# features: ['first_tag', 'second_tag', 'origin', 'messages'],
# num_rows: 3344
# })
# })
split_name = list(dataset.keys())[0]
sample = dataset[split_name][0]
# print(sample)
# 输出示例:
# {
# 'first_tag': [],
# 'second_tag': [],
# 'origin': {
# 'config': 'LibreText_Chemistry-default',
# 'dataset': 'jablonkagroup/chempile-education',
# 'prompt_type': 'engaging',
# 'split': 'train'
# },
# 'messages': [
# {
# 'content': 'Can you explain what a hydrogen bond is in chemistry?',
# 'role': 'user'
# },
# {
# 'content': 'Sure!...
# 'role': 'assistant'
# }
# ... 更多消息
# ]
# }
## 🎯 应用场景
- **🤖 指令微调**:针对化学专属对话场景微调大语言模型
- **💬 对话式AI**:构建化学聊天机器人与虚拟助手
- **📚 教育系统**:开发交互式化学辅导平台
- **🔬 科研辅助**:为化学科研人员打造AI助手
- **🧠 推理能力提升**:训练模型完成化学问题求解
- **📝 多轮对话**:学习情境化对话模式
## ⚠️ 局限性与注意事项
- **语言**:仅支持英语(单语种数据集)
- **生成内容**:AI生成内容可能存在不准确之处
- **覆盖范围**:涵盖教育与科研类化学内容,但未涉及工业应用场景
- **偏差**:可能反映源材料与生成模型自带的偏差
- **上下文要求**:多轮对话格式需正确处理对话上下文
- **评估建议**:生成内容需经过事实准确性验证
## 🛠️ 数据处理流程
1. **源数据收集**:从教育、科研及推理类数据源采集内容
2. **提示词设计**:构建四种差异化提示策略以实现多样性
3. **对话生成**:使用`gpt-4o-mini-2024-07-18`生成对话内容
4. **标签标注**:应用技能与子领域分类体系
5. **质量控制**:对生成的对话进行筛选与验证
6. **格式标准化**:统一转换为LiteLLM对话格式
7. **数据集划分**:创建训练/验证/测试划分以支持模型评估
## 🏗️ ChemPile 数据集合集
本数据集属于**ChemPile**合集,这是一个开源综合数据集,包含超过750亿个经过精选的化学数据Token,用于训练与评估化学科学领域的通用模型。
### 合集概览
- **📊 规模**:覆盖多模态的750亿+ Token
- **🧬 模态**:结构化表示(SMILES、SELFIES、IUPAC、InChI)、科学文本、可执行代码与分子图像
- **🎯 设计理念**:融合基础教育知识与专业科研文献
- **🔬 精选处理**:经过严格的专家筛选与验证
- **📈 基准测试**:标准化的训练/验证/测试划分以支持可靠评估
- **🌐 开放获取**:通过Hugging Face平台公开发布
## 📄 引用规范
若您在研究中使用该数据集,请引用以下文献:
bibtex
@article{mirza2025chempile0,
title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models},
author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others},
year = {2025},
journal = {arXiv preprint arXiv:2505.12534}
}
## 👥 联系与支持
- **学术论文**:[arXiv:2505.12534](https://arxiv.org/abs/2505.12534)
- **项目官网**:[ChemPile项目页](https://chempile.lamalab.org/)
- **数据集页面**:[Hugging Face数据集页](https://huggingface.co/datasets/jablonkagroup/chempile-instruction)
- **问题反馈**:请通过Hugging Face数据集页面提交数据问题或咨询
---
<div align="center">

<i>ChemPile项目的一部分——推动化学科学领域的AI发展</i>
</div>
提供机构:
maas
创建时间:
2025-07-27



