five

Mirinterplay/MarkushGrapher-2-Datasets

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Mirinterplay/MarkushGrapher-2-Datasets
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: ip5-markush features: - name: id dtype: string - name: page_image_path dtype: string - name: annotation dtype: string - name: cxsmiles_dataset dtype: string - name: cxsmiles dtype: string - name: cxsmiles_opt dtype: string - name: cells list: - name: bbox list: float64 - name: text dtype: string - name: page_image dtype: image splits: - name: test num_bytes: 84715528 num_examples: 878 download_size: 84505277 dataset_size: 84715528 - config_name: m2s features: - name: id dtype: int64 - name: image_name dtype: string - name: page_image dtype: image - name: annotation dtype: string - name: cxsmiles_dataset dtype: string - name: cxsmiles dtype: string - name: cxsmiles_opt dtype: string - name: cells list: - name: bbox list: float64 - name: text dtype: string splits: - name: test num_bytes: 18029421 num_examples: 103 download_size: 17985733 dataset_size: 18029421 - config_name: uspto-markush features: - name: id dtype: int64 - name: image_name dtype: string - name: page_image dtype: image - name: annotation dtype: string - name: cxsmiles_dataset dtype: string - name: cxsmiles dtype: string - name: cxsmiles_opt dtype: string - name: cells list: - name: bbox list: float64 - name: text dtype: string splits: - name: test num_bytes: 5196740 num_examples: 74 download_size: 5179068 dataset_size: 5196740 - config_name: uspto-mol-m-54k features: - name: id dtype: int64 - name: image_name dtype: string - name: page_image dtype: image - name: annotation dtype: string - name: cxsmiles_dataset dtype: string - name: cxsmiles dtype: string - name: cxsmiles_opt dtype: string - name: cells list: - name: bbox list: float64 - name: text dtype: string splits: - name: train num_bytes: 2707938707 num_examples: 54785 - name: test num_bytes: 10645945 num_examples: 200 download_size: 2675522805 dataset_size: 2718584652 configs: - config_name: ip5-markush data_files: - split: test path: ip5-markush/test-* - config_name: m2s data_files: - split: test path: m2s/test-* - config_name: uspto-markush data_files: - split: test path: uspto-markush/test-* - config_name: uspto-mol-m-54k data_files: - split: train path: uspto-mol-m-54k/train-* - split: test path: uspto-mol-m-54k/test-* --- # MarkushGrapher 2 Datasets Datasets for training and evaluating **MarkushGrapher 2**, a model for converting patent Markush structure images into CXSMILES representations. ## Dataset Subsets | Subset | Train | Test | Description | OCR | |---|---|---|---|---| | `uspto-mol-m-54k-new` | 54,785 | 200 | USPTO-MOL-M Markush samples | ChemicalOCR predictions | | `uspto-markush` | — | 74 | USPTO Markush structures benchmark | Ground Truth OCR | | `m2s` | — | 103 | Mol2Smiles (M2S) benchmark | Ground Truth OCR | | `IP5-markush` | — | 878 | IP5 Markush structures benchmark | Ground Truth OCR | ## Features Each sample contains: - **`page_image`** — Input patent image (PIL Image, typically 1024×1024) - **`cells`** — OCR-detected text cells with bounding boxes (`bbox` in normalized coordinates, `text`) - **`cxsmiles`** — Ground truth CXSMILES representation - **`cxsmiles_opt`** — Optimized (tokenizer-friendly) CXSMILES representation - **`cxsmiles_dataset`** — Original CXSMILES from the source dataset - **`annotation`** — Annotation metadata (used to train model) - **`image_name`** — Source image filename - **`id`** — Sample identifier ## Usage ```python from datasets import load_dataset # Load a specific subset dataset = load_dataset("docling-project/MarkushGrapher-2-Datasets", "uspto-mol-m-54k") # Load a benchmark subset benchmark = load_dataset("docling-project/MarkushGrapher-2-Datasets", "m2s") ``` ## Note **MarkushGrapher-2** is also trained on the following datasets: - **Phase 1:** 243k real-world image–SMILES pairs from [MolScribe](https://huggingface.co/yujieq/MolScribe) - **Phase 2:** - 235k synthetically generated image–CXSMILES pairs from [MarkushGrapher-Datasets (v1)](https://huggingface.co/datasets/docling-project/MarkushGrapher-Datasets/viewer/markushgrapher-synthetic-training) - 91k samples from [MolParser Dataset](https://huggingface.co/datasets/UniParser/MolParser-7M/viewer/sft_real) ## Citation If you use this dataset, please cite: ```bibtex @inproceedings{strohmeyer2026markushgrapher2, title = {MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures}, author = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026} } ``` ### License This Dataset is released under the Creative Commons Attribution 4.0 License.

--- 数据集信息: - 配置名称:ip5-markush 特征: - 字段名:id,数据类型:字符串 - 字段名:page_image_path,数据类型:字符串 - 字段名:annotation,数据类型:字符串 - 字段名:cxsmiles_dataset,数据类型:字符串 - 字段名:cxsmiles,数据类型:字符串 - 字段名:cxsmiles_opt,数据类型:字符串 - 字段名:cells:列表类型,子项包含: - 字段名:bbox:float64类型列表 - 字段名:text:字符串类型 - 字段名:page_image:图像类型 划分集: - 划分名称:test,字节数:84715528,样本数量:878 下载大小:84505277,数据集总大小:84715528 - 配置名称:m2s 特征: - 字段名:id,数据类型:int64 - 字段名:image_name,数据类型:字符串 - 字段名:page_image,数据类型:图像 - 字段名:annotation,数据类型:字符串 - 字段名:cxsmiles_dataset,数据类型:字符串 - 字段名:cxsmiles,数据类型:字符串 - 字段名:cxsmiles_opt,数据类型:字符串 - 字段名:cells:列表类型,子项包含: - 字段名:bbox:float64类型列表 - 字段名:text:字符串类型 划分集: - 划分名称:test,字节数:18029421,样本数量:103 下载大小:17985733,数据集总大小:18029421 - 配置名称:uspto-markush 特征: - 字段名:id,数据类型:int64 - 字段名:image_name,数据类型:字符串 - 字段名:page_image,数据类型:图像 - 字段名:annotation,数据类型:字符串 - 字段名:cxsmiles_dataset,数据类型:字符串 - 字段名:cxsmiles,数据类型:字符串 - 字段名:cxsmiles_opt,数据类型:字符串 - 字段名:cells:列表类型,子项包含: - 字段名:bbox:float64类型列表 - 字段名:text:字符串类型 划分集: - 划分名称:test,字节数:5196740,样本数量:74 下载大小:5179068,数据集总大小:5196740 - 配置名称:uspto-mol-m-54k 特征: - 字段名:id,数据类型:int64 - 字段名:image_name,数据类型:字符串 - 字段名:page_image,数据类型:图像 - 字段名:annotation,数据类型:字符串 - 字段名:cxsmiles_dataset,数据类型:字符串 - 字段名:cxsmiles,数据类型:字符串 - 字段名:cxsmiles_opt,数据类型:字符串 - 字段名:cells:列表类型,子项包含: - 字段名:bbox:float64类型列表 - 字段名:text:字符串类型 划分集: - 划分名称:train,字节数:2707938707,样本数量:54785 - 划分名称:test,字节数:10645945,样本数量:200 下载大小:2675522805,数据集总大小:2718584652 配置项: - 配置名称:ip5-markush,数据文件: - 划分集:test,路径:ip5-markush/test-* - 配置名称:m2s,数据文件: - 划分集:test,路径:m2s/test-* - 配置名称:uspto-markush,数据文件: - 划分集:test,路径:uspto-markush/test-* - 配置名称:uspto-mol-m-54k,数据文件: - 划分集:train,路径:uspto-mol-m-54k/train-* - 划分集:test,路径:uspto-mol-m-54k/test-* --- # MarkushGrapher 2 数据集 用于训练和评估**MarkushGrapher 2**的数据集,该模型可将专利马库什结构图像转换为CXSMILES(CXSMILES)表示形式。 ## 数据集子集 | 子集名称 | 训练集样本数 | 测试集样本数 | 描述 | OCR类型 | |---|---|---|---|---| | `uspto-mol-m-54k-new` | 54,785 | 200 | USPTO-MOL-M 马库什样本 | ChemicalOCR 预测结果 | | `uspto-markush` | — | 74 | USPTO 马库什结构基准测试集 | 真实标注OCR | | `m2s` | — | 103 | Mol2Smiles(M2S)基准测试集 | 真实标注OCR | | `IP5-markush` | — | 878 | IP5 马库什结构基准测试集 | 真实标注OCR | ## 数据特征 每个样本包含以下内容: - **`page_image`**:输入专利图像(PIL图像(PIL Image),典型分辨率为1024×1024) - **`cells`**:经光学字符识别(Optical Character Recognition, OCR)检测到的文本单元格,包含归一化坐标形式的边界框(`bbox`)与识别文本(`text`) - **`cxsmiles`**:真实标注的CXSMILES表示形式 - **`cxsmiles_opt`**:经过优化(适配分词器)的CXSMILES表示形式 - **`cxsmiles_dataset`**:源自原始数据集的CXSMILES内容 - **`annotation`**:用于模型训练的标注元数据 - **`image_name`**:源图像文件名 - **`id`**:样本唯一标识符 ## 使用方法 python from datasets import load_dataset # 加载指定子集 dataset = load_dataset("docling-project/MarkushGrapher-2-Datasets", "uspto-mol-m-54k") # 加载基准测试子集 benchmark = load_dataset("docling-project/MarkushGrapher-2-Datasets", "m2s") ## 补充说明 **MarkushGrapher-2** 还在以下数据集上进行了训练: - **阶段1**:来自[MolScribe](https://huggingface.co/yujieq/MolScribe)的243,000条真实世界图像-SMILES(SMILES)配对数据 - **阶段2**: - 来自[MarkushGrapher-Datasets (v1)](https://huggingface.co/datasets/docling-project/MarkushGrapher-Datasets/viewer/markushgrapher-synthetic-training)的235,000条合成生成的图像-CXSMILES配对数据 - 来自[MolParser Dataset](https://huggingface.co/datasets/UniParser/MolParser-7M/viewer/sft_real)的91,000条样本 ## 引用 若使用本数据集,请引用以下文献: bibtex @inproceedings{strohmeyer2026markushgrapher2, title = {MarkushGrapher-2: 化学结构端到端多模态识别}, author = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.}, booktitle = {IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集}, year = {2026} } ### 许可协议 本数据集采用知识共享署名4.0许可协议(Creative Commons Attribution 4.0 License)发布。
提供机构:
Mirinterplay
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作