Mirinterplay/MarkushGrapher-2-Datasets
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Mirinterplay/MarkushGrapher-2-Datasets
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: ip5-markush
features:
- name: id
dtype: string
- name: page_image_path
dtype: string
- name: annotation
dtype: string
- name: cxsmiles_dataset
dtype: string
- name: cxsmiles
dtype: string
- name: cxsmiles_opt
dtype: string
- name: cells
list:
- name: bbox
list: float64
- name: text
dtype: string
- name: page_image
dtype: image
splits:
- name: test
num_bytes: 84715528
num_examples: 878
download_size: 84505277
dataset_size: 84715528
- config_name: m2s
features:
- name: id
dtype: int64
- name: image_name
dtype: string
- name: page_image
dtype: image
- name: annotation
dtype: string
- name: cxsmiles_dataset
dtype: string
- name: cxsmiles
dtype: string
- name: cxsmiles_opt
dtype: string
- name: cells
list:
- name: bbox
list: float64
- name: text
dtype: string
splits:
- name: test
num_bytes: 18029421
num_examples: 103
download_size: 17985733
dataset_size: 18029421
- config_name: uspto-markush
features:
- name: id
dtype: int64
- name: image_name
dtype: string
- name: page_image
dtype: image
- name: annotation
dtype: string
- name: cxsmiles_dataset
dtype: string
- name: cxsmiles
dtype: string
- name: cxsmiles_opt
dtype: string
- name: cells
list:
- name: bbox
list: float64
- name: text
dtype: string
splits:
- name: test
num_bytes: 5196740
num_examples: 74
download_size: 5179068
dataset_size: 5196740
- config_name: uspto-mol-m-54k
features:
- name: id
dtype: int64
- name: image_name
dtype: string
- name: page_image
dtype: image
- name: annotation
dtype: string
- name: cxsmiles_dataset
dtype: string
- name: cxsmiles
dtype: string
- name: cxsmiles_opt
dtype: string
- name: cells
list:
- name: bbox
list: float64
- name: text
dtype: string
splits:
- name: train
num_bytes: 2707938707
num_examples: 54785
- name: test
num_bytes: 10645945
num_examples: 200
download_size: 2675522805
dataset_size: 2718584652
configs:
- config_name: ip5-markush
data_files:
- split: test
path: ip5-markush/test-*
- config_name: m2s
data_files:
- split: test
path: m2s/test-*
- config_name: uspto-markush
data_files:
- split: test
path: uspto-markush/test-*
- config_name: uspto-mol-m-54k
data_files:
- split: train
path: uspto-mol-m-54k/train-*
- split: test
path: uspto-mol-m-54k/test-*
---
# MarkushGrapher 2 Datasets
Datasets for training and evaluating **MarkushGrapher 2**, a model for converting patent Markush structure images into CXSMILES representations.
## Dataset Subsets
| Subset | Train | Test | Description | OCR |
|---|---|---|---|---|
| `uspto-mol-m-54k-new` | 54,785 | 200 | USPTO-MOL-M Markush samples | ChemicalOCR predictions |
| `uspto-markush` | — | 74 | USPTO Markush structures benchmark | Ground Truth OCR |
| `m2s` | — | 103 | Mol2Smiles (M2S) benchmark | Ground Truth OCR |
| `IP5-markush` | — | 878 | IP5 Markush structures benchmark | Ground Truth OCR |
## Features
Each sample contains:
- **`page_image`** — Input patent image (PIL Image, typically 1024×1024)
- **`cells`** — OCR-detected text cells with bounding boxes (`bbox` in normalized coordinates, `text`)
- **`cxsmiles`** — Ground truth CXSMILES representation
- **`cxsmiles_opt`** — Optimized (tokenizer-friendly) CXSMILES representation
- **`cxsmiles_dataset`** — Original CXSMILES from the source dataset
- **`annotation`** — Annotation metadata (used to train model)
- **`image_name`** — Source image filename
- **`id`** — Sample identifier
## Usage
```python
from datasets import load_dataset
# Load a specific subset
dataset = load_dataset("docling-project/MarkushGrapher-2-Datasets", "uspto-mol-m-54k")
# Load a benchmark subset
benchmark = load_dataset("docling-project/MarkushGrapher-2-Datasets", "m2s")
```
## Note
**MarkushGrapher-2** is also trained on the following datasets:
- **Phase 1:** 243k real-world image–SMILES pairs from [MolScribe](https://huggingface.co/yujieq/MolScribe)
- **Phase 2:**
- 235k synthetically generated image–CXSMILES pairs from [MarkushGrapher-Datasets (v1)](https://huggingface.co/datasets/docling-project/MarkushGrapher-Datasets/viewer/markushgrapher-synthetic-training)
- 91k samples from [MolParser Dataset](https://huggingface.co/datasets/UniParser/MolParser-7M/viewer/sft_real)
## Citation
If you use this dataset, please cite:
```bibtex
@inproceedings{strohmeyer2026markushgrapher2,
title = {MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures},
author = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
```
### License
This Dataset is released under the Creative Commons Attribution 4.0 License.
---
数据集信息:
- 配置名称:ip5-markush
特征:
- 字段名:id,数据类型:字符串
- 字段名:page_image_path,数据类型:字符串
- 字段名:annotation,数据类型:字符串
- 字段名:cxsmiles_dataset,数据类型:字符串
- 字段名:cxsmiles,数据类型:字符串
- 字段名:cxsmiles_opt,数据类型:字符串
- 字段名:cells:列表类型,子项包含:
- 字段名:bbox:float64类型列表
- 字段名:text:字符串类型
- 字段名:page_image:图像类型
划分集:
- 划分名称:test,字节数:84715528,样本数量:878
下载大小:84505277,数据集总大小:84715528
- 配置名称:m2s
特征:
- 字段名:id,数据类型:int64
- 字段名:image_name,数据类型:字符串
- 字段名:page_image,数据类型:图像
- 字段名:annotation,数据类型:字符串
- 字段名:cxsmiles_dataset,数据类型:字符串
- 字段名:cxsmiles,数据类型:字符串
- 字段名:cxsmiles_opt,数据类型:字符串
- 字段名:cells:列表类型,子项包含:
- 字段名:bbox:float64类型列表
- 字段名:text:字符串类型
划分集:
- 划分名称:test,字节数:18029421,样本数量:103
下载大小:17985733,数据集总大小:18029421
- 配置名称:uspto-markush
特征:
- 字段名:id,数据类型:int64
- 字段名:image_name,数据类型:字符串
- 字段名:page_image,数据类型:图像
- 字段名:annotation,数据类型:字符串
- 字段名:cxsmiles_dataset,数据类型:字符串
- 字段名:cxsmiles,数据类型:字符串
- 字段名:cxsmiles_opt,数据类型:字符串
- 字段名:cells:列表类型,子项包含:
- 字段名:bbox:float64类型列表
- 字段名:text:字符串类型
划分集:
- 划分名称:test,字节数:5196740,样本数量:74
下载大小:5179068,数据集总大小:5196740
- 配置名称:uspto-mol-m-54k
特征:
- 字段名:id,数据类型:int64
- 字段名:image_name,数据类型:字符串
- 字段名:page_image,数据类型:图像
- 字段名:annotation,数据类型:字符串
- 字段名:cxsmiles_dataset,数据类型:字符串
- 字段名:cxsmiles,数据类型:字符串
- 字段名:cxsmiles_opt,数据类型:字符串
- 字段名:cells:列表类型,子项包含:
- 字段名:bbox:float64类型列表
- 字段名:text:字符串类型
划分集:
- 划分名称:train,字节数:2707938707,样本数量:54785
- 划分名称:test,字节数:10645945,样本数量:200
下载大小:2675522805,数据集总大小:2718584652
配置项:
- 配置名称:ip5-markush,数据文件:
- 划分集:test,路径:ip5-markush/test-*
- 配置名称:m2s,数据文件:
- 划分集:test,路径:m2s/test-*
- 配置名称:uspto-markush,数据文件:
- 划分集:test,路径:uspto-markush/test-*
- 配置名称:uspto-mol-m-54k,数据文件:
- 划分集:train,路径:uspto-mol-m-54k/train-*
- 划分集:test,路径:uspto-mol-m-54k/test-*
---
# MarkushGrapher 2 数据集
用于训练和评估**MarkushGrapher 2**的数据集,该模型可将专利马库什结构图像转换为CXSMILES(CXSMILES)表示形式。
## 数据集子集
| 子集名称 | 训练集样本数 | 测试集样本数 | 描述 | OCR类型 |
|---|---|---|---|---|
| `uspto-mol-m-54k-new` | 54,785 | 200 | USPTO-MOL-M 马库什样本 | ChemicalOCR 预测结果 |
| `uspto-markush` | — | 74 | USPTO 马库什结构基准测试集 | 真实标注OCR |
| `m2s` | — | 103 | Mol2Smiles(M2S)基准测试集 | 真实标注OCR |
| `IP5-markush` | — | 878 | IP5 马库什结构基准测试集 | 真实标注OCR |
## 数据特征
每个样本包含以下内容:
- **`page_image`**:输入专利图像(PIL图像(PIL Image),典型分辨率为1024×1024)
- **`cells`**:经光学字符识别(Optical Character Recognition, OCR)检测到的文本单元格,包含归一化坐标形式的边界框(`bbox`)与识别文本(`text`)
- **`cxsmiles`**:真实标注的CXSMILES表示形式
- **`cxsmiles_opt`**:经过优化(适配分词器)的CXSMILES表示形式
- **`cxsmiles_dataset`**:源自原始数据集的CXSMILES内容
- **`annotation`**:用于模型训练的标注元数据
- **`image_name`**:源图像文件名
- **`id`**:样本唯一标识符
## 使用方法
python
from datasets import load_dataset
# 加载指定子集
dataset = load_dataset("docling-project/MarkushGrapher-2-Datasets", "uspto-mol-m-54k")
# 加载基准测试子集
benchmark = load_dataset("docling-project/MarkushGrapher-2-Datasets", "m2s")
## 补充说明
**MarkushGrapher-2** 还在以下数据集上进行了训练:
- **阶段1**:来自[MolScribe](https://huggingface.co/yujieq/MolScribe)的243,000条真实世界图像-SMILES(SMILES)配对数据
- **阶段2**:
- 来自[MarkushGrapher-Datasets (v1)](https://huggingface.co/datasets/docling-project/MarkushGrapher-Datasets/viewer/markushgrapher-synthetic-training)的235,000条合成生成的图像-CXSMILES配对数据
- 来自[MolParser Dataset](https://huggingface.co/datasets/UniParser/MolParser-7M/viewer/sft_real)的91,000条样本
## 引用
若使用本数据集,请引用以下文献:
bibtex
@inproceedings{strohmeyer2026markushgrapher2,
title = {MarkushGrapher-2: 化学结构端到端多模态识别},
author = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.},
booktitle = {IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集},
year = {2026}
}
### 许可协议
本数据集采用知识共享署名4.0许可协议(Creative Commons Attribution 4.0 License)发布。
提供机构:
Mirinterplay



