chempile-reasoning
收藏魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/chempile-reasoning
下载链接
链接失效反馈官方服务:
资源简介:
# ChemPile-Reasoning
<div align="center">

[](https://huggingface.co/datasets/jablonkagroup/chempile-reasoning)
[](https://creativecommons.org/licenses/by-sa/4.0/)
[](https://arxiv.org/abs/2505.12534)
[](https://chempile.lamalab.org/)
*A comprehensive collection of reasoning tasks for chemistry, spectral analysis, and scientific understanding*
</div>
## 📋 Dataset Summary
ChemPile-Reasoning is a dataset designed for reasoning tasks in the field of chemistry. It is part of the ChemPile project, which aims to create a comprehensive collection of chemistry-related data for training language models. This dataset includes a variety of reasoning tasks derived from scientific Stack Exchange platforms, as well as reasoning traces from state-of-the-art (SOTA) language models. The dataset is structured to facilitate the evaluation of reasoning capabilities in chemistry-related contexts.
The dataset includes different subsets or Hugging Face configurations that correspond to different sources of scientific material:
- chemistry_stackexchange-completion_0
- chemistry_stackexchange-completion_1
- chemistry_stackexchange-instruction_0
- chemistry_stackexchange-instruction_1
- chemistry_stackexchange-instruction_2
- chemistry_stackexchange-raw_data
- claude-3.5-distilled-spectral-reasoning-default
- mattermodeling_stackexchange-completion_0
- mattermodeling_stackexchange-completion_1
- mattermodeling_stackexchange-instruction_0
- mattermodeling_stackexchange-instruction_1
- mattermodeling_stackexchange-instruction_2
- mattermodeling_stackexchange-raw_data
- physics_stackexchange-completion_0
- physics_stackexchange-completion_1
- physics_stackexchange-instruction_0
- physics_stackexchange-instruction_1
- physics_stackexchange-instruction_2
- physics_stackexchange-raw_data
- spectra_reasoning_deepseek-default
- spectra_reasoning_deepseek_mcq-default
All the content is made open-source under the license cc-by-sa-4.0, allowing for free use and redistribution with proper attribution.
### 📊 Dataset Statistics
| Subset | Examples | Tokens | Description |
|--------|----------|--------|-------------|
| StackExchange | 71,658 | 21.3B | Reasoning tasks from scientific Stack Exchange platforms |
| Spectra Reasoning | 1,070 | 2.16M | Spectral analysis reasoning traces from SOTA models |
| **Total** | **~72.7K** | **~21.3B** | Scientific reasoning tasks and traces |
## 🗂️ Dataset Configurations
### 🧪 Spectra Reasoning
The Spectra Reasoning subsets of ChemPile-Reasoning contain reasoning tasks derived from spectral data, specifically focusing on the analysis and interpretation of spectral information. The dataset includes three configurations: two distilled for DeepSeek-R1 model reasoning about a series of spectra (proton and carbon NMR and IR) for one molecule, one open-ended and another for multiple-choice questions (MCQ) based on spectral data, and other configuration distilled from Claude-3.5-Sonnet for single-spectra reasoning (only proton NMR). The dataset is designed to evaluate the reasoning capabilities of language models in the context of spectral analysis.
**DeepSeek Configurations Fields**:
- `smiles`: The SMILES representation of the molecule associated with the spectral data
- `reasoning`: The reasoning trace or explanation provided by the model for the spectral analysis
- `response`: The model's response to the spectral reasoning task
- `response_smiles`: The SMILES representation of the molecule parsed from the model's response
- `correct`: If the model's response is correct or not, based on the spectral data
- `question`: The question or task related to the spectral data that the model is addressing
- `text`: The joined text of the question, reasoning, and response for the model's output
**Claude-3.5-Sonnet Configuration Fields**:
- `prompt`: The prompt or question related to the spectral data
- `extracted_reasoning`: The reasoning trace or explanation with the final answer provided by the model for the spectral analysis
- `text`: The joined text of the prompt and extracted reasoning for the model's output
- `index`: The index of the example in the dataset
**Statistics**: 1.07K examples with a total of over 2.16M tokens
### 📚 StackExchange
The StackExchange subsets of ChemPile-Reasoning contains reasoning tasks derived from scientific Stack Exchange platforms, specifically from the chemistry, matter modeling and physics domains. For each of the datasets, different configs are available: two in completion format and three in instruction format, as well as the raw data. For the different formats, different text templates are used to structure the data. The completion format is designed for tasks where the model needs to generate a response based on a given input, while the instruction format provides a more structured approach with specific instructions for the model to follow. The raw data config contains the original data without any modifications or formatting.
**Completion and Instruction Format Fields**:
- `text`: The original text from the Stack Exchange post
- `input`: The input text for the model, which may include the question or context
- `output`: The expected output or answer to the question
- `answer_choices`: A list of possible answer choices for the question
- `correct_output_index`: The index of the correct answer in the answer_choices list
**Raw Data Configuration Fields**:
- `title`: The title of the Stack Exchange post
- `q`: The question text from the Stack Exchange post
- `a`: The answer text from the Stack Exchange post
- `split`: The split of the dataset (train, test, or validation)
- `index`: The index of the post in the dataset
- `text`: The joined text of the title, question, and answer for the post
**Statistics**: 71,658 examples with a total of over 21.3B tokens
## � License
All content is released under the **CC BY-SA 4.0** license, which allows for:
- ✅ Free use and distribution
- ✅ Commercial use
- ✅ Modification and derivatives
- ⚠️ Attribution required
- ⚠️ Share-alike requirements
## �🚀 Quick Start
```python
from datasets import load_dataset, get_dataset_config_names
# Print available configs for the dataset
configs = get_dataset_config_names("jablonkagroup/chempile-reasoning")
print(f"Available configs: {configs}")
# Available configs: ['chemistry_stackexchange-completion_0', 'chemistry_stackexchang...
dataset = load_dataset("jablonkagroup/chempile-reasoning", name=configs[0])
# Loading config: chemistry_stackexchange-completion_0
print(dataset)
# DatasetDict({
# train: Dataset({
# features: ['text', 'input', 'output', 'answer_choices', 'correct_output_index'],
# num_rows: 3207
# })
# test: Dataset({
# features: ['text', 'input', 'output', 'answer_choices', 'correct_output_index'],
# num_rows: 687
# })
# val: Dataset({
# features: ['text', 'input', 'output', 'answer_choices', 'correct_output_index'],
# num_rows: 687
# })
# })
split_name = list(dataset.keys())[0]
sample = dataset[split_name][0]
print(sample)
# {
# 'text': 'The answer to the query "We know that the...
# 'input': 'The answer to the query "We know that the...
# 'output': '',
# 'answer_choices': [],
# 'correct_output_index': None
# }
```
## 🎯 Use Cases
- **� Scientific Reasoning**: Training models for complex chemical and physical reasoning tasks
- **📊 Spectral Analysis**: Building systems for automated spectral interpretation and structure elucidation
- **🔬 Educational AI**: Developing tutoring systems for chemistry and materials science education
- **� Question Answering**: Advanced scientific question-answering systems for research support
- **🤖 Research Assistance**: Automated analysis and interpretation of scientific problems
## ⚠️ Limitations & Considerations
- **Language**: Primarily English content (monolingual dataset)
- **Scope**: Focused on chemistry, physics, and materials science; specialized domain knowledge required
- **Quality**: Variable quality across sources; some reasoning traces may contain errors or inconsistencies
- **Bias**: Reflects biases present in Stack Exchange communities and model-generated content
- **Complexity**: Contains advanced scientific concepts that may require domain expertise to validate
## 🛠️ Data Processing Pipeline
1. **Collection**: Automated extraction from Stack Exchange platforms and model reasoning traces
2. **Filtering**: Domain-specific filtering for chemistry, physics, and materials science relevance
3. **Format Conversion**: Multiple formatting approaches (completion, instruction, raw data)
4. **Quality Control**: Expert validation and automated filtering
5. **Reasoning Extraction**: Parsing and structuring of model reasoning traces
6. **Standardization**: Consistent formatting and metadata extraction
7. **Validation**: Train/validation/test splits and quality checks
## 🏗️ ChemPile Collection
This dataset is part of the **ChemPile** collection, a comprehensive open dataset containing over 75 billion tokens of curated chemical data for training and evaluating general-purpose models in the chemical sciences.
### Collection Overview
- **📊 Scale**: 75+ billion tokens across multiple modalities
- **🧬 Modalities**: Structured representations (SMILES, SELFIES, IUPAC, InChI), scientific text, executable code, reasoning traces, and molecular images
- **🎯 Design**: Integrates foundational educational knowledge with specialized scientific literature
- **🔬 Curation**: Extensive expert curation and validation
- **📈 Benchmarking**: Standardized train/validation/test splits for robust evaluation
- **🌐 Availability**: Openly released via Hugging Face
## 📄 Citation
If you use this dataset in your research, please cite:
```bibtex
@article{mirza2025chempile0,
title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models},
author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others},
year = {2025},
journal = {arXiv preprint arXiv:2505.12534}
}
```
## 👥 Contact & Support
- **Paper**: [arXiv:2505.12534](https://arxiv.org/abs/2505.12534)
- **Website**: [ChemPile Project](https://chempile.lamalab.org/)
- **Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/chempile-reasoning)
- **Issues**: Please report data issues or questions via the Hugging Face dataset page
---
<div align="center">

<i>Part of the ChemPile project - Advancing AI for Chemical Sciences</i>
</div>
# ChemPile-Reasoning
<div align="center">

[](https://huggingface.co/datasets/jablonkagroup/chempile-reasoning)
[](https://creativecommons.org/licenses/by-sa/4.0/)
[](https://arxiv.org/abs/2505.12534)
[](https://chempile.lamalab.org/)
*一款面向化学、光谱分析与科学理解领域的综合性推理任务数据集*
</div>
## 📋 数据集概述
ChemPile-Reasoning是一款专为化学领域推理任务设计的数据集,隶属于ChemPile项目——该项目旨在构建一套涵盖化学相关数据的综合性集合,用于大语言模型(Large Language Model,LLM)的训练。本数据集包含源自科学Stack Exchange(栈交换)平台的各类推理任务,以及当前最优(SOTA)大语言模型生成的推理轨迹,其结构旨在支持化学场景下模型推理能力的评估。
本数据集包含多个子集或拥抱脸(Hugging Face)配置,对应不同的科学材料来源:
- chemistry_stackexchange-completion_0
- chemistry_stackexchange-completion_1
- chemistry_stackexchange-instruction_0
- chemistry_stackexchange-instruction_1
- chemistry_stackexchange-instruction_2
- chemistry_stackexchange-raw_data
- claude-3.5-distilled-spectral-reasoning-default
- mattermodeling_stackexchange-completion_0
- mattermodeling_stackexchange-completion_1
- mattermodeling_stackexchange-instruction_0
- mattermodeling_stackexchange-instruction_1
- mattermodeling_stackexchange-instruction_2
- mattermodeling_stackexchange-raw_data
- physics_stackexchange-completion_0
- physics_stackexchange-completion_1
- physics_stackexchange-instruction_0
- physics_stackexchange-instruction_1
- physics_stackexchange-instruction_2
- physics_stackexchange-raw_data
- spectra_reasoning_deepseek-default
- spectra_reasoning_deepseek_mcq-default
所有内容均以CC BY-SA 4.0许可证开源,允许在注明原作者的前提下自由使用与再分发。
### 📊 数据集统计
| 子集 | 样本数 | Token数 | 描述 |
|--------|----------|--------|-------------|
| StackExchange | 71,658 | 21.3B | 来自科学Stack Exchange平台的推理任务 |
| 光谱推理 | 1,070 | 2.16M | 来自SOTA模型的光谱分析推理轨迹 |
| **总计** | **~72.7K** | **~21.3B** | 科学推理任务与轨迹集合 |
## 🗂️ 数据集配置
### 🧪 光谱推理
ChemPile-Reasoning的光谱推理子集包含源自光谱数据的推理任务,专注于光谱信息的分析与解读。本数据集包含三种配置:两种为DeepSeek-R1模型针对单分子的质子、碳核磁共振波谱(NMR)与红外光谱(IR)推理的蒸馏结果,分别为开放式任务与基于光谱数据的多项选择题(Multiple Choice Question,MCQ)任务;另一种配置由Claude-3.5-Sonnet蒸馏得到,仅支持单光谱(仅质子NMR)推理。本数据集旨在评估大语言模型在光谱分析场景下的推理能力。
**DeepSeek配置字段**:
- `smiles`:与光谱数据关联的分子简化分子线性输入规范(SMILES)表示
- `reasoning`:模型针对光谱分析的推理轨迹或解释
- `response`:模型对光谱推理任务的输出结果
- `response_smiles`:从模型输出中解析得到的分子SMILES表示
- `correct`:表示模型输出是否符合光谱数据的正确结果
- `question`:模型需要解决的与光谱数据相关的问题或任务
- `text`:将问题、推理轨迹与模型输出拼接后的模型输出文本
**Claude-3.5-Sonnet配置字段**:
- `prompt`:与光谱数据相关的提示或问题
- `extracted_reasoning`:模型针对光谱分析提供的推理轨迹与最终答案
- `text`:将提示与提取的推理轨迹拼接后的文本
- `index`:数据集中样本的索引
**统计信息**:共1070个样本,总计超过216万Token
### 📚 StackExchange子集
ChemPile-Reasoning的StackExchange子集包含源自科学Stack Exchange平台的推理任务,覆盖化学、材料建模与物理学领域。每个数据集均提供多种配置:两种补全格式、三种指令格式,以及原始数据配置。不同格式采用不同的文本模板对数据进行结构化处理:补全格式用于模型需基于给定输入生成响应的任务,指令格式则通过为模型提供明确遵循的指令来实现更结构化的处理;原始数据配置则保留未经任何修改或格式化的原始数据。
**补全与指令格式字段**:
- `text`:Stack Exchange帖子的原始文本
- `input`:模型的输入文本,可包含问题或上下文
- `output`:预期的问题答案
- `answer_choices`:问题的可选答案列表
- `correct_output_index`:`answer_choices`列表中正确答案的索引
**原始数据配置字段**:
- `title`:Stack Exchange帖子的标题
- `q`:帖子中的问题文本
- `a`:帖子中的回答文本
- `split`:数据集划分(训练集、测试集或验证集)
- `index`:帖子在数据集中的索引
- `text`:将标题、问题与回答拼接后的文本
**统计信息**:共71,658个样本,总计超过21.3B Token
## ⚠️ 许可证
所有内容均以**CC BY-SA 4.0**许可证发布,允许:
- ✅ 免费使用与分发
- ✅ 商业使用
- ✅ 修改与衍生创作
- ⚠️ 需注明原作者
- ⚠️ 需以相同许可证协议共享衍生作品
## 🚀 快速入门
python
from datasets import load_dataset, get_dataset_config_names
# Print available configs for the dataset
configs = get_dataset_config_names("jablonkagroup/chempile-reasoning")
print(f"Available configs: {configs}")
# Available configs: ['chemistry_stackexchange-completion_0', 'chemistry_stackexchang...
dataset = load_dataset("jablonkagroup/chempile-reasoning", name=configs[0])
# Loading config: chemistry_stackexchange-completion_0
print(dataset)
# DatasetDict({
# train: Dataset({
# features: ['text', 'input', 'output', 'answer_choices', 'correct_output_index'],
# num_rows: 3207
# })
# test: Dataset({
# features: ['text', 'input', 'output', 'answer_choices', 'correct_output_index'],
# num_rows: 687
# })
# val: Dataset({
# features: ['text', 'input', 'output', 'answer_choices', 'correct_output_index'],
# num_rows: 687
# })
# })
split_name = list(dataset.keys())[0]
sample = dataset[split_name][0]
print(sample)
# {
# 'text': 'The answer to the query "We know that the...
# 'input': 'The answer to the query "We know that the...
# 'output': '',
# 'answer_choices': [],
# 'correct_output_index': None
# }
## 🎯 应用场景
- **🧪 科学推理**:训练模型完成复杂的化学与物理推理任务
- **📊 光谱分析**:构建自动化光谱解读与结构解析系统
- **🔬 教育AI**:开发面向化学与材料科学的智能辅导系统
- **❓ 问答系统**:面向科研支持的高级科学问答系统
- **🤖 科研辅助**:自动化分析与解读科学问题
## ⚠️ 局限性与注意事项
- **语言**:仅包含英文内容(单语种数据集)
- **范围**:专注于化学、物理学与材料科学领域,需具备专业领域知识方可理解与验证
- **质量**:不同来源的内容质量参差不齐,部分推理轨迹可能存在错误或不一致性
- **偏差**:反映了Stack Exchange社区与模型生成内容中存在的固有偏差
- **复杂度**:包含高级科学概念,需具备领域专业知识方可准确验证
## 🛠️ 数据处理流程
1. **数据采集**:从Stack Exchange平台与模型推理轨迹中自动提取数据
2. **筛选过滤**:针对化学、物理学与材料科学领域的相关性进行专属筛选
3. **格式转换**:采用多种格式化方案(补全格式、指令格式、原始数据格式)
4. **质量控制**:专家验证与自动化筛选结合
5. **推理轨迹提取**:解析并结构化模型生成的推理轨迹
6. **标准化处理**:统一数据格式与元数据提取
7. **验证划分**:生成训练/验证/测试划分并进行质量检查
## 🏗️ ChemPile数据集合集
本数据集隶属于**ChemPile**合集——一款涵盖超过750亿Token的开源化学领域综合数据集,用于化学科学领域通用模型的训练与评估。
### 合集概览
- **📊 规模**:覆盖多模态的750亿+ Token数据
- **🧬 模态**:结构化表示(SMILES、SELFIES、IUPAC、InChI)、科学文本、可执行代码、推理轨迹与分子图像
- **🎯 设计理念**:整合基础教育知识与专业科学文献
- **🔬 数据审核**:经过大量专家审核与验证
- **📈 基准测试**:标准化的训练/验证/测试划分,支持可靠的模型评估
- **🌐 可获取性**:通过拥抱脸(Hugging Face)平台开源发布
## 📄 引用格式
如果您在研究中使用本数据集,请引用以下文献:
bibtex
@article{mirza2025chempile0,
title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models},
author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others},
year = {2025},
journal = {arXiv preprint arXiv:2505.12534}
}
## 👥 联系与支持
- **论文**:[arXiv:2505.12534](https://arxiv.org/abs/2505.12534)
- **官网**:[ChemPile 项目官网](https://chempile.lamalab.org/)
- **数据集**:[Hugging Face 数据集页面](https://huggingface.co/datasets/jablonkagroup/chempile-reasoning)
- **问题反馈**:请通过Hugging Face数据集页面报告数据问题或咨询相关疑问
<div align="center">

<i>隶属于ChemPile项目——推动化学科学领域的人工智能发展</i>
</div>
提供机构:
maas
创建时间:
2025-05-28



