Mirinterplay/MarkushGrapher-2-Datasets

Name: Mirinterplay/MarkushGrapher-2-Datasets
Creator: Mirinterplay
Published: 2026-04-08 05:55:08
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Mirinterplay/MarkushGrapher-2-Datasets

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: ip5-markush features: - name: id dtype: string - name: page_image_path dtype: string - name: annotation dtype: string - name: cxsmiles_dataset dtype: string - name: cxsmiles dtype: string - name: cxsmiles_opt dtype: string - name: cells list: - name: bbox list: float64 - name: text dtype: string - name: page_image dtype: image splits: - name: test num_bytes: 84715528 num_examples: 878 download_size: 84505277 dataset_size: 84715528 - config_name: m2s features: - name: id dtype: int64 - name: image_name dtype: string - name: page_image dtype: image - name: annotation dtype: string - name: cxsmiles_dataset dtype: string - name: cxsmiles dtype: string - name: cxsmiles_opt dtype: string - name: cells list: - name: bbox list: float64 - name: text dtype: string splits: - name: test num_bytes: 18029421 num_examples: 103 download_size: 17985733 dataset_size: 18029421 - config_name: uspto-markush features: - name: id dtype: int64 - name: image_name dtype: string - name: page_image dtype: image - name: annotation dtype: string - name: cxsmiles_dataset dtype: string - name: cxsmiles dtype: string - name: cxsmiles_opt dtype: string - name: cells list: - name: bbox list: float64 - name: text dtype: string splits: - name: test num_bytes: 5196740 num_examples: 74 download_size: 5179068 dataset_size: 5196740 - config_name: uspto-mol-m-54k features: - name: id dtype: int64 - name: image_name dtype: string - name: page_image dtype: image - name: annotation dtype: string - name: cxsmiles_dataset dtype: string - name: cxsmiles dtype: string - name: cxsmiles_opt dtype: string - name: cells list: - name: bbox list: float64 - name: text dtype: string splits: - name: train num_bytes: 2707938707 num_examples: 54785 - name: test num_bytes: 10645945 num_examples: 200 download_size: 2675522805 dataset_size: 2718584652 configs: - config_name: ip5-markush data_files: - split: test path: ip5-markush/test-* - config_name: m2s data_files: - split: test path: m2s/test-* - config_name: uspto-markush data_files: - split: test path: uspto-markush/test-* - config_name: uspto-mol-m-54k data_files: - split: train path: uspto-mol-m-54k/train-* - split: test path: uspto-mol-m-54k/test-* --- # MarkushGrapher 2 Datasets Datasets for training and evaluating **MarkushGrapher 2**, a model for converting patent Markush structure images into CXSMILES representations. ## Dataset Subsets | Subset | Train | Test | Description | OCR | |---|---|---|---|---| | `uspto-mol-m-54k-new` | 54,785 | 200 | USPTO-MOL-M Markush samples | ChemicalOCR predictions | | `uspto-markush` | — | 74 | USPTO Markush structures benchmark | Ground Truth OCR | | `m2s` | — | 103 | Mol2Smiles (M2S) benchmark | Ground Truth OCR | | `IP5-markush` | — | 878 | IP5 Markush structures benchmark | Ground Truth OCR | ## Features Each sample contains: - **`page_image`** — Input patent image (PIL Image, typically 1024×1024) - **`cells`** — OCR-detected text cells with bounding boxes (`bbox` in normalized coordinates, `text`) - **`cxsmiles`** — Ground truth CXSMILES representation - **`cxsmiles_opt`** — Optimized (tokenizer-friendly) CXSMILES representation - **`cxsmiles_dataset`** — Original CXSMILES from the source dataset - **`annotation`** — Annotation metadata (used to train model) - **`image_name`** — Source image filename - **`id`** — Sample identifier ## Usage ```python from datasets import load_dataset # Load a specific subset dataset = load_dataset("docling-project/MarkushGrapher-2-Datasets", "uspto-mol-m-54k") # Load a benchmark subset benchmark = load_dataset("docling-project/MarkushGrapher-2-Datasets", "m2s") ``` ## Note **MarkushGrapher-2** is also trained on the following datasets: - **Phase 1:** 243k real-world image–SMILES pairs from [MolScribe](https://huggingface.co/yujieq/MolScribe) - **Phase 2:** - 235k synthetically generated image–CXSMILES pairs from [MarkushGrapher-Datasets (v1)](https://huggingface.co/datasets/docling-project/MarkushGrapher-Datasets/viewer/markushgrapher-synthetic-training) - 91k samples from [MolParser Dataset](https://huggingface.co/datasets/UniParser/MolParser-7M/viewer/sft_real) ## Citation If you use this dataset, please cite: ```bibtex @inproceedings{strohmeyer2026markushgrapher2, title = {MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures}, author = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026} } ``` ### License This Dataset is released under the Creative Commons Attribution 4.0 License.

--- 数据集信息： - 配置名称：ip5-markush 特征： - 字段名：id，数据类型：字符串 - 字段名：page_image_path，数据类型：字符串 - 字段名：annotation，数据类型：字符串 - 字段名：cxsmiles_dataset，数据类型：字符串 - 字段名：cxsmiles，数据类型：字符串 - 字段名：cxsmiles_opt，数据类型：字符串 - 字段名：cells：列表类型，子项包含： - 字段名：bbox：float64类型列表 - 字段名：text：字符串类型 - 字段名：page_image：图像类型划分集： - 划分名称：test，字节数：84715528，样本数量：878 下载大小：84505277，数据集总大小：84715528 - 配置名称：m2s 特征： - 字段名：id，数据类型：int64 - 字段名：image_name，数据类型：字符串 - 字段名：page_image，数据类型：图像 - 字段名：annotation，数据类型：字符串 - 字段名：cxsmiles_dataset，数据类型：字符串 - 字段名：cxsmiles，数据类型：字符串 - 字段名：cxsmiles_opt，数据类型：字符串 - 字段名：cells：列表类型，子项包含： - 字段名：bbox：float64类型列表 - 字段名：text：字符串类型划分集： - 划分名称：test，字节数：18029421，样本数量：103 下载大小：17985733，数据集总大小：18029421 - 配置名称：uspto-markush 特征： - 字段名：id，数据类型：int64 - 字段名：image_name，数据类型：字符串 - 字段名：page_image，数据类型：图像 - 字段名：annotation，数据类型：字符串 - 字段名：cxsmiles_dataset，数据类型：字符串 - 字段名：cxsmiles，数据类型：字符串 - 字段名：cxsmiles_opt，数据类型：字符串 - 字段名：cells：列表类型，子项包含： - 字段名：bbox：float64类型列表 - 字段名：text：字符串类型划分集： - 划分名称：test，字节数：5196740，样本数量：74 下载大小：5179068，数据集总大小：5196740 - 配置名称：uspto-mol-m-54k 特征： - 字段名：id，数据类型：int64 - 字段名：image_name，数据类型：字符串 - 字段名：page_image，数据类型：图像 - 字段名：annotation，数据类型：字符串 - 字段名：cxsmiles_dataset，数据类型：字符串 - 字段名：cxsmiles，数据类型：字符串 - 字段名：cxsmiles_opt，数据类型：字符串 - 字段名：cells：列表类型，子项包含： - 字段名：bbox：float64类型列表 - 字段名：text：字符串类型划分集： - 划分名称：train，字节数：2707938707，样本数量：54785 - 划分名称：test，字节数：10645945，样本数量：200 下载大小：2675522805，数据集总大小：2718584652 配置项： - 配置名称：ip5-markush，数据文件： - 划分集：test，路径：ip5-markush/test-* - 配置名称：m2s，数据文件： - 划分集：test，路径：m2s/test-* - 配置名称：uspto-markush，数据文件： - 划分集：test，路径：uspto-markush/test-* - 配置名称：uspto-mol-m-54k，数据文件： - 划分集：train，路径：uspto-mol-m-54k/train-* - 划分集：test，路径：uspto-mol-m-54k/test-* --- # MarkushGrapher 2 数据集用于训练和评估**MarkushGrapher 2**的数据集，该模型可将专利马库什结构图像转换为CXSMILES（CXSMILES）表示形式。 ## 数据集子集 | 子集名称 | 训练集样本数 | 测试集样本数 | 描述 | OCR类型 | |---|---|---|---|---| | `uspto-mol-m-54k-new` | 54,785 | 200 | USPTO-MOL-M 马库什样本 | ChemicalOCR 预测结果 | | `uspto-markush` | — | 74 | USPTO 马库什结构基准测试集 | 真实标注OCR | | `m2s` | — | 103 | Mol2Smiles（M2S）基准测试集 | 真实标注OCR | | `IP5-markush` | — | 878 | IP5 马库什结构基准测试集 | 真实标注OCR | ## 数据特征每个样本包含以下内容： - **`page_image`**：输入专利图像（PIL图像（PIL Image），典型分辨率为1024×1024） - **`cells`**：经光学字符识别（Optical Character Recognition, OCR）检测到的文本单元格，包含归一化坐标形式的边界框（`bbox`）与识别文本（`text`） - **`cxsmiles`**：真实标注的CXSMILES表示形式 - **`cxsmiles_opt`**：经过优化（适配分词器）的CXSMILES表示形式 - **`cxsmiles_dataset`**：源自原始数据集的CXSMILES内容 - **`annotation`**：用于模型训练的标注元数据 - **`image_name`**：源图像文件名 - **`id`**：样本唯一标识符 ## 使用方法 python from datasets import load_dataset # 加载指定子集 dataset = load_dataset("docling-project/MarkushGrapher-2-Datasets", "uspto-mol-m-54k") # 加载基准测试子集 benchmark = load_dataset("docling-project/MarkushGrapher-2-Datasets", "m2s") ## 补充说明 **MarkushGrapher-2** 还在以下数据集上进行了训练： - **阶段1**：来自[MolScribe](https://huggingface.co/yujieq/MolScribe)的243,000条真实世界图像-SMILES（SMILES）配对数据 - **阶段2**： - 来自[MarkushGrapher-Datasets (v1)](https://huggingface.co/datasets/docling-project/MarkushGrapher-Datasets/viewer/markushgrapher-synthetic-training)的235,000条合成生成的图像-CXSMILES配对数据 - 来自[MolParser Dataset](https://huggingface.co/datasets/UniParser/MolParser-7M/viewer/sft_real)的91,000条样本 ## 引用若使用本数据集，请引用以下文献： bibtex @inproceedings{strohmeyer2026markushgrapher2, title = {MarkushGrapher-2: 化学结构端到端多模态识别}, author = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.}, booktitle = {IEEE/CVF 计算机视觉与模式识别会议（CVPR）论文集}, year = {2026} } ### 许可协议本数据集采用知识共享署名4.0许可协议（Creative Commons Attribution 4.0 License）发布。

提供机构：

Mirinterplay

5,000+

优质数据集

54 个

任务类型

进入经典数据集