five

chempile-instruction

收藏
魔搭社区2025-12-05 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/chempile-instruction
下载链接
链接失效反馈
官方服务:
资源简介:
# ChemPile-Instruction <div align="center"> ![ChemPile Logo](CHEMPILE_LOGO.png) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/chempile-instruction) [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2505.12534) [![Website](https://img.shields.io/badge/🌐-Website-green)](https://chempile.lamalab.org/) *A comprehensive instruction tuning dataset for chemistry LLMs with multi-turn conversations and diverse reasoning tasks* </div> ## 📋 Dataset Summary ChemPile-Instruction is a text-only dataset designed for instruction tuning of Large Language Models (LLMs) in the field of chemistry. It contains high-quality multi-turn conversations, each rephrased from different educational, scientific, and reasoning sources using diverse prompting strategies. The conversations were generated using `gpt-4o-mini-2024-07-18` with various prompts to ensure a diverse range of topics and conversational styles. ### 📊 Dataset Statistics | Configuration | Description | Content Source | |---------------|-------------|----------------| | chempile-education | Educational conversations | Textbooks and online courses | | chempile-paper-100m | Research paper discussions | Academic papers and publications | | chempile-reasoning | Chemistry reasoning tasks | Logic and deduction problems | ## 🗂️ Dataset Configurations The dataset includes three distinct subsets available as Hugging Face configurations: - `chempile-education`: Educational conversations suitable for learners at various levels - `chempile-paper-100m`: Research-focused discussions for advanced learners and researchers - `chempile-reasoning`: Reasoning tasks to improve logical thinking skills in chemistry ## 📜 License All content is released under the **CC BY 4.0** license, which allows for: - ✅ Commercial and non-commercial use - ✅ Sharing and redistribution - ✅ Adaptation and modification - ⚠️ Attribution required ## 📖 Dataset Details ### 🎯 Data Fields Each conversation in the dataset contains: - **`messages`** (list): Conversation messages in [LiteLLM format](https://docs.litellm.ai/docs/completion/prompt_formatting) - `role` (str): Message sender role ("user" or "assistant") - `content` (str): Message content - **`first_tag`** (list): Required skills for the conversation: - `requires-knowledge`: Factual or domain-specific knowledge - `requires-calculation`: Mathematical or computational reasoning - `requires-reasoning`: Logical or deductive reasoning - **`second_tag`** (list): Chemistry subdomains covered: - `Analytical Chemistry` - `General Chemistry` - `Inorganic Chemistry` - `Materials Science` - `Organic Chemistry` - `Physical Chemistry` - `Technical Chemistry` - **`origin`** (dict): Generation metadata: - `dataset`: Original source dataset name - `config`: Configuration used for generation - `split`: Original dataset split - `prompt`: Prompt type used (`engaging`, `hard`, `wiki`, `none`) ### 🎨 Generation Strategy The dataset uses four distinct prompting approaches based on [Pieler et al.](https://doi.org/10.48550/arXiv.2410.20796): - **🎪 Engaging**: Elicits detailed explanations and insights with engaging tone - **🧠 Hard**: Enforces esoteric, complex vocabulary for advanced concepts - **📚 Wiki**: Provides encyclopedic, factual responses similar to Wikipedia - **🔄 None**: Natural model responses without specific style constraints This diversity ensures comprehensive coverage of chemistry topics and conversational styles. ### 📊 Quality Control - Based on [ChemBench work](https://chembench.lamalab.org/) taxonomies - Multi-turn conversation structure for contextual learning - Diverse prompting strategies for varied response styles - Comprehensive skill and subdomain tagging system ## 🚀 Quick Start ```python from datasets import load_dataset, get_dataset_config_names # Print available configs for the dataset configs = get_dataset_config_names("jablonkagroup/chempile-instruction") print(f"Available configs: {configs}") # Available configs: ['chempile-education', 'chempile-paper-100m', 'chempile-reasoning'] dataset = load_dataset("jablonkagroup/chempile-instruction", name=configs[0]) # Loading config: chempile-education print(dataset) # DatasetDict({ # train: Dataset({ # features: ['first_tag', 'second_tag', 'origin', 'messages'], # num_rows: 60171 # }) # test: Dataset({ # features: ['first_tag', 'second_tag', 'origin', 'messages'], # num_rows: 3343 # }) # val: Dataset({ # features: ['first_tag', 'second_tag', 'origin', 'messages'], # num_rows: 3344 # }) # }) split_name = list(dataset.keys())[0] sample = dataset[split_name][0] # print(sample) # { # 'first_tag': [], # 'second_tag': [], # 'origin': { # 'config': 'LibreText_Chemistry-default', # 'dataset': 'jablonkagroup/chempile-education', # 'prompt_type': 'engaging', # 'split': 'train' # }, # 'messages': [ # { # 'content': 'Can you explain what a hydrogen bond is in chemistry?', # 'role': 'user' # }, # { # 'content': 'Sure!... # 'role': 'assistant' # } # ... more messages # ] # } ``` ## 🎯 Use Cases - **🤖 Instruction Tuning**: Fine-tuning LLMs for chemistry-specific conversations - **💬 Conversational AI**: Building chemistry chatbots and virtual assistants - **📚 Educational Systems**: Developing interactive chemistry tutoring platforms - **🔬 Research Support**: Creating AI assistants for chemistry researchers - **🧠 Reasoning Enhancement**: Training models for chemical problem-solving - **📝 Multi-turn Dialogue**: Learning contextual conversation patterns ## ⚠️ Limitations & Considerations - **Language**: English only (monolingual dataset) - **Generation**: AI-generated content may contain inaccuracies - **Scope**: Covers educational and research chemistry but not industrial applications - **Bias**: May reflect biases from source materials and generation models - **Context**: Multi-turn format requires proper conversation handling - **Evaluation**: Generated content should be validated for factual accuracy ## 🛠️ Data Processing Pipeline 1. **Source Collection**: Gathering from educational, scientific, and reasoning sources 2. **Prompt Design**: Creating four distinct prompting strategies for diversity 3. **Generation**: Using `gpt-4o-mini-2024-07-18` for conversation creation 4. **Tagging**: Applying skill and subdomain classification systems 5. **Quality Control**: Filtering and validation of generated conversations 6. **Formatting**: Standardizing to LiteLLM conversation format 7. **Splitting**: Creating train/validation/test splits for evaluation ## 🏗️ ChemPile Collection This dataset is part of the **ChemPile** collection, a comprehensive open dataset containing over 75 billion tokens of curated chemical data for training and evaluating general-purpose models in the chemical sciences. ### Collection Overview - **📊 Scale**: 75+ billion tokens across multiple modalities - **🧬 Modalities**: Structured representations (SMILES, SELFIES, IUPAC, InChI), scientific text, executable code, and molecular images - **🎯 Design**: Integrates foundational educational knowledge with specialized scientific literature - **🔬 Curation**: Extensive expert curation and validation - **📈 Benchmarking**: Standardized train/validation/test splits for robust evaluation - **🌐 Availability**: Openly released via Hugging Face ## 📄 Citation If you use this dataset in your research, please cite: ```bibtex @article{mirza2025chempile0, title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models}, author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others}, year = {2025}, journal = {arXiv preprint arXiv:2505.12534} } ``` ## 👥 Contact & Support - **Paper**: [arXiv:2505.12534](https://arxiv.org/abs/2505.12534) - **Website**: [ChemPile Project](https://chempile.lamalab.org/) - **Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/chempile-instruction) - **Issues**: Please report data issues or questions via the Hugging Face dataset page --- <div align="center"> ![LamaLab logo](png-file.png) <i>Part of the ChemPile project - Advancing AI for Chemical Sciences</i> </div>

# ChemPile-Instruction <div align="center"> ![ChemPile 标识](CHEMPILE_LOGO.png) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/chempile-instruction) [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2505.12534) [![Website](https://img.shields.io/badge/🌐-Website-green)](https://chempile.lamalab.org/) *面向化学领域大语言模型(Large Language Model, LLM)的多轮对话与多样化推理任务指令微调综合数据集* </div> ## 📋 数据集摘要 ChemPile-Instruction是专为化学领域大语言模型指令微调设计的纯文本数据集。该数据集包含高质量多轮对话,所有对话均通过多样化提示策略,从不同教育、科研及推理类源文本改写而来。对话由`gpt-4o-mini-2024-07-18`结合各类提示生成,以确保覆盖多样的主题与对话风格。 ### 📊 数据集统计 | 配置名称 | 描述 | 内容来源 | |---------------|-------------|----------------| | chempile-education | 教育类对话 | 教科书与在线课程 | | chempile-paper-100m | 科研论文讨论 | 学术论文与出版物 | | chempile-reasoning | 化学推理任务 | 逻辑与演绎问题 | ## 🗂️ 数据集配置 该数据集包含三个独立子集,可通过Hugging Face配置调用: - `chempile-education`:面向各水平学习者的教育类对话 - `chempile-paper-100m`:面向进阶学习者与科研人员的科研主题讨论 - `chempile-reasoning`:用于提升化学领域逻辑思维能力的推理任务 ## 📜 授权协议 所有内容采用**CC BY 4.0(知识共享署名4.0)**协议发布,允许: - ✅ 商业与非商业使用 - ✅ 分享与再分发 - ✅ 改编与修改 - ⚠️ 需注明原作者 ## 📖 数据集详情 ### 🎯 数据字段 每个对话包含以下字段: - **`messages`**(列表):采用[LiteLLM格式](https://docs.litellm.ai/docs/completion/prompt_formatting)的对话消息列表 - `role`(字符串):消息发送者角色("user"即用户或"assistant"即助手) - `content`(字符串):消息内容 - **`first_tag`**(列表):对话所需技能标签: - `requires-knowledge`:需事实性或领域专业知识 - `requires-calculation`:需数学或计算推理 - `requires-reasoning`:需逻辑或演绎推理 - **`second_tag`**(列表):覆盖的化学子领域: - `Analytical Chemistry`(分析化学) - `General Chemistry`(普通化学) - `Inorganic Chemistry`(无机化学) - `Materials Science`(材料科学) - `Organic Chemistry`(有机化学) - `Physical Chemistry`(物理化学) - `Technical Chemistry`(技术化学) - **`origin`**(字典):生成元数据: - `dataset`:原始源数据集名称 - `config`:生成所用配置 - `split`:原始数据集划分 - `prompt`:所用提示类型(`engaging`即互动式、`hard`即高阶难度、`wiki`即维基百科式、`none`即无约束) ### 🎨 生成策略 基于[Pieler等人的研究](https://doi.org/10.48550/arXiv.2410.20796),数据集采用四种差异化提示方法: - **🎪 互动式(Engaging)**:以互动语调引导详细解释与见解 - **🧠 高阶难度(Hard)**:针对进阶概念使用晦涩、专业的词汇 - **📚 维基百科式(Wiki)**:提供类似维基百科的百科式事实性回复 - **🔄 无约束(None)**:无特定风格限制的自然模型回复 这种多样性确保了化学主题与对话风格的全面覆盖。 ### 📊 质量控制 - 基于[ChemBench](https://chembench.lamalab.org/)分类体系 - 采用多轮对话结构以支持情境化学习 - 多样化提示策略实现多样化回复风格 - 完善的技能与子领域标签系统 ## 🚀 快速上手 python from datasets import load_dataset, get_dataset_config_names # 打印数据集可用的配置项 configs = get_dataset_config_names("jablonkagroup/chempile-instruction") print(f"可用配置项: {configs}") # 输出示例: 可用配置项: ['chempile-education', 'chempile-paper-100m', 'chempile-reasoning'] dataset = load_dataset("jablonkagroup/chempile-instruction", name=configs[0]) # 加载配置: chempile-education print(dataset) # 输出示例: # DatasetDict({ # train: Dataset({ # features: ['first_tag', 'second_tag', 'origin', 'messages'], # num_rows: 60171 # }) # test: Dataset({ # features: ['first_tag', 'second_tag', 'origin', 'messages'], # num_rows: 3343 # }) # val: Dataset({ # features: ['first_tag', 'second_tag', 'origin', 'messages'], # num_rows: 3344 # }) # }) split_name = list(dataset.keys())[0] sample = dataset[split_name][0] # print(sample) # 输出示例: # { # 'first_tag': [], # 'second_tag': [], # 'origin': { # 'config': 'LibreText_Chemistry-default', # 'dataset': 'jablonkagroup/chempile-education', # 'prompt_type': 'engaging', # 'split': 'train' # }, # 'messages': [ # { # 'content': 'Can you explain what a hydrogen bond is in chemistry?', # 'role': 'user' # }, # { # 'content': 'Sure!... # 'role': 'assistant' # } # ... 更多消息 # ] # } ## 🎯 应用场景 - **🤖 指令微调**:针对化学专属对话场景微调大语言模型 - **💬 对话式AI**:构建化学聊天机器人与虚拟助手 - **📚 教育系统**:开发交互式化学辅导平台 - **🔬 科研辅助**:为化学科研人员打造AI助手 - **🧠 推理能力提升**:训练模型完成化学问题求解 - **📝 多轮对话**:学习情境化对话模式 ## ⚠️ 局限性与注意事项 - **语言**:仅支持英语(单语种数据集) - **生成内容**:AI生成内容可能存在不准确之处 - **覆盖范围**:涵盖教育与科研类化学内容,但未涉及工业应用场景 - **偏差**:可能反映源材料与生成模型自带的偏差 - **上下文要求**:多轮对话格式需正确处理对话上下文 - **评估建议**:生成内容需经过事实准确性验证 ## 🛠️ 数据处理流程 1. **源数据收集**:从教育、科研及推理类数据源采集内容 2. **提示词设计**:构建四种差异化提示策略以实现多样性 3. **对话生成**:使用`gpt-4o-mini-2024-07-18`生成对话内容 4. **标签标注**:应用技能与子领域分类体系 5. **质量控制**:对生成的对话进行筛选与验证 6. **格式标准化**:统一转换为LiteLLM对话格式 7. **数据集划分**:创建训练/验证/测试划分以支持模型评估 ## 🏗️ ChemPile 数据集合集 本数据集属于**ChemPile**合集,这是一个开源综合数据集,包含超过750亿个经过精选的化学数据Token,用于训练与评估化学科学领域的通用模型。 ### 合集概览 - **📊 规模**:覆盖多模态的750亿+ Token - **🧬 模态**:结构化表示(SMILES、SELFIES、IUPAC、InChI)、科学文本、可执行代码与分子图像 - **🎯 设计理念**:融合基础教育知识与专业科研文献 - **🔬 精选处理**:经过严格的专家筛选与验证 - **📈 基准测试**:标准化的训练/验证/测试划分以支持可靠评估 - **🌐 开放获取**:通过Hugging Face平台公开发布 ## 📄 引用规范 若您在研究中使用该数据集,请引用以下文献: bibtex @article{mirza2025chempile0, title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models}, author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others}, year = {2025}, journal = {arXiv preprint arXiv:2505.12534} } ## 👥 联系与支持 - **学术论文**:[arXiv:2505.12534](https://arxiv.org/abs/2505.12534) - **项目官网**:[ChemPile项目页](https://chempile.lamalab.org/) - **数据集页面**:[Hugging Face数据集页](https://huggingface.co/datasets/jablonkagroup/chempile-instruction) - **问题反馈**:请通过Hugging Face数据集页面提交数据问题或咨询 --- <div align="center"> ![LamaLab 标识](png-file.png) <i>ChemPile项目的一部分——推动化学科学领域的AI发展</i> </div>
提供机构:
maas
创建时间:
2025-07-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作