chempile-caption
收藏魔搭社区2025-12-05 更新2025-06-28 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/chempile-caption
下载链接
链接失效反馈官方服务:
资源简介:
# ChemPile-Caption
<div align="center">

[](https://huggingface.co/datasets/jablonkagroup/chempile-caption)
[](https://creativecommons.org/licenses/by-nc-sa/4.0/)
[](https://arxiv.org/abs/2505.12534)
[](https://chempile.lamalab.org/)
*A comprehensive collection of image-caption pairs for chemistry-related visual content*
</div>
ChemPile-Caption is a dataset designed for image captioning tasks in the field of chemistry. It is part of the ChemPile project, which aims to create a comprehensive collection of chemistry-related data for training language models. This dataset includes a variety of images scraped from LibreTexts related to chemical structures, reactions, and laboratory equipment, along with corresponding captions that describe the content of the images.
All the content is made open-source under the license cc-by-nc-sa-4.0, allowing for non-commercial use and adaptation with proper attribution.
The origin of the dataset is the LibreTexts project, which provides a wealth of educational resources in chemistry. The images in this dataset are sourced from various LibreTexts pages, ensuring a diverse range of chemical topics and visual representations. To obtain the images, an in house web scraping process was employed, specifically targeting all the books LibreTexts Chemistry. The images were downloaded and stored in a structured format, with each image associated with its corresponding caption and alt text, filtering out any images that did not have a long enough caption or alt text.
The dataset is structured into a single simple default configuration, which simplifies the loading and usage of the dataset. The configuration includes the following fields:
- text: The alt text plus the caption of the image, providing a detailed description of the image content.
- image: The image, allowing users to access the visual content directly.
Thus, the resulting ChemPile-Caption dataset contains a total of 100K image-caption pairs.
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("jablonkagroup/chempile-caption")
print(dataset)
# DatasetDict({
# train: Dataset({
# features: ['text', 'image'],
# num_rows: 90350
# })
# validation: Dataset({
# features: ['text', 'image'],
# num_rows: 5019
# })
# test: Dataset({
# features: ['text', 'image'],
# num_rows: 5020
# })
# })
sample = dataset['train'][0]
print(f"Sample caption: {sample}")
# Sample caption: {'text': '2 drawings and a photograph, as described...', 'image': <PIL...}
```
## 🏗️ ChemPile Collection
This dataset is part of the **ChemPile** collection, a comprehensive open dataset containing over 75 billion tokens of curated chemical data for training and evaluating general-purpose models in the chemical sciences.
### Collection Overview
- **📊 Scale**: 75+ billion tokens across multiple modalities
- **🧬 Modalities**: Structured representations (SMILES, SELFIES, IUPAC, InChI), scientific text, executable code, reasoning traces, and molecular images
- **🎯 Design**: Integrates foundational educational knowledge with specialized scientific literature
- **🔬 Curation**: Extensive expert curation and validation
- **📈 Benchmarking**: Standardized train/validation/test splits for robust evaluation
- **🌐 Availability**: Openly released via Hugging Face
## 📄 Citation
If you use this dataset in your research, please cite:
```bibtex
@article{mirza2025chempile0,
title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models},
author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others},
year = {2025},
journal = {arXiv preprint arXiv:2505.12534}
}
```
## 👥 Contact & Support
- **Paper**: [arXiv:2505.12534](https://arxiv.org/abs/2505.12534)
- **Website**: [ChemPile Project](https://chempile.lamalab.org/)
- **Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/chempile-caption)
- **Issues**: Please report data issues or questions via the Hugging Face dataset page
---
<div align="center">

<i>Part of the ChemPile project - Advancing AI for Chemical Sciences</i>
</div>
# ChemPile-Caption
<div align="center">

[](https://huggingface.co/datasets/jablonkagroup/chempile-caption)
[](https://creativecommons.org/licenses/by-nc-sa/4.0/)
[
[
*一个面向化学相关视觉内容的图像-文本对综合集合*
</div>
ChemPile-Caption是一款面向化学领域图像字幕生成任务的数据集,隶属于ChemPile项目。该项目旨在构建一套覆盖化学相关数据的综合集合,用于大语言模型(Large Language Model, LLM)的训练。本数据集包含从LibreTexts平台爬取的各类图像,涵盖化学结构、化学反应与实验仪器等主题,并附带用于描述图像内容的对应字幕文本。
本数据集所有内容均采用CC BY-NC-SA 4.0许可协议开源,允许非商业性使用与二次改编,但需保留原作者署名。
本数据集的源头为LibreTexts项目,该平台提供了丰富的化学教育资源。数据集内的图像均来自LibreTexts的各类页面,确保覆盖多样化的化学主题与视觉表现形式。我们通过自研的网络爬虫流程,定向爬取了LibreTexts化学板块的全部图书资源,将图像下载并以结构化格式存储,为每张图像匹配对应的字幕与替代文本(alt text),并过滤掉字幕或替代文本长度不足的图像。
数据集采用单一简洁的默认配置,简化了数据集的加载与使用流程。该配置包含以下字段:
- text:图像的替代文本与字幕文本的组合,用于详细描述图像内容。
- image:原始图像文件,供用户直接访问视觉内容。
最终构建完成的ChemPile-Caption数据集共包含10万条图像-文本对。
## 使用方法
python
from datasets import load_dataset
dataset = load_dataset("jablonkagroup/chempile-caption")
print(dataset)
# 数据集字典(DatasetDict)包含以下划分:
# 训练集(train):数据集对象,包含特征['text', 'image'],共90350条样本
# 验证集(validation):数据集对象,包含特征['text', 'image'],共5019条样本
# 测试集(test):数据集对象,包含特征['text', 'image'],共5020条样本
#
sample = dataset['train'][0]
print(f"样本字幕:{sample}")
# 样本字幕:{'text': '如描述所示的2张示意图与1张照片...', 'image': <PIL图像对象...}
## 🏗️ ChemPile 数据集合集
本数据集隶属于**ChemPile**数据集合集,该合集是一套大规模开源化学数据集,包含超过750亿Token的精选化学数据,用于训练与评估化学领域通用人工智能模型。
### 合集概览
- **📊 规模**:多模态数据合计超过750亿Token
- **🧬 模态类型**:涵盖结构化化学表征(SMILES、SELFIES、IUPAC命名、InChI)、科学文本、可执行代码、推理轨迹与分子图像
- **🎯 设计理念**:融合基础化学教育知识与专业科学文献数据
- **🔬 数据精选**:经过多轮专家筛选与验证
- **📈 基准测试**:采用标准化的训练/验证/测试集划分,支持可靠的模型评估
- **🌐 开源渠道**:通过Hugging Face平台公开发布
## 📄 引用方式
若您在研究中使用本数据集,请引用以下文献:
bibtex
@article{mirza2025chempile0,
title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models},
author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and others},
year = {2025},
journal = {arXiv preprint arXiv:2505.12534}
}
## 👥 联系与支持
- **学术论文**:[arXiv:2505.12534](https://arxiv.org/abs/2505.12534)
- **项目官网**:[ChemPile 项目](https://chempile.lamalab.org/)
- **数据集页面**:[Hugging Face 数据集仓库](https://huggingface.co/datasets/jablonkagroup/chempile-caption)
- **问题反馈**:请通过Hugging Face数据集页面提交数据相关问题或咨询
---
<div align="center">

<i>ChemPile 项目组成部分 —— 推动化学领域人工智能技术发展</i>
</div>
提供机构:
maas
创建时间:
2025-05-27



