MarkushGrapher-Datasets
收藏魔搭社区2025-12-05 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/MarkushGrapher-Datasets
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/64d38f55f8082bf19b7339e0/V43x-_idEdiCQIfbm0eVM.jpeg" alt="Description" width="800">
</div>
This repository contains datasets introduced in [MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures](https://github.com/DS4SD/MarkushGrapher).
Training:
- **MarkushGrapher-Synthetic-Training**: This set contains synthetic Markush structures used for training MarkushGrapher. Samples are synthetically generated using the following steps: (1) SMILES to CXSMILES conversion using RDKit; (2) CXSMILES rendering using CDK; (3) text description generation using templates; and (4) text description augmentation with LLM.
Benchmarks:
- **M2S**: This set contains 103 real Markush structures from patent documents. Samples are crops of both Markush structure backbone images and their textual descriptions. They are extracted from documents published in USPTO, EPO and WIPO.
- **USPTO-Markush**: This set contains 75 real Markush structure backbone images from patent documents. They are extracted from documents published in USPTO.
- **MarkushGrapher-Synthetic**: This set contains 1000 synthetic Markush structures. Its images are sampled such that overall, each Markush features (R-groups, ’m’ and ’Sg’ sections) is represented evenly.
An example of how to read the dataset is provided in [dataset_explorer.ipynb](https://huggingface.co/datasets/ds4sd/MarkushGrapher-Datasets/blob/main/dataset_explorer.ipynb).
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/64d38f55f8082bf19b7339e0/V43x-_idEdiCQIfbm0eVM.jpeg" alt="描述" width="800">
</div>
本仓库收录了《MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures》一文所提出的数据集,相关仓库地址为:https://github.com/DS4SD/MarkushGrapher。
### 训练集
- **MarkushGrapher-Synthetic-Training**:该数据集包含用于训练MarkushGrapher的合成马库什结构(Markush Structure)。样本通过以下步骤合成生成:(1) 借助RDKit将简化分子线性输入系统(SMILES)转换为CXSMILES;(2) 使用化学开发工具包(CDK,Chemistry Development Kit)完成CXSMILES的图像渲染;(3) 利用模板生成文本描述;(4) 通过大语言模型(LLM)对文本描述进行数据增强。
### 基准测试集
- **M2S**:该数据集包含来自专利文献的103个真实马库什结构。样本为马库什结构骨架图像及其对应文本描述的裁剪片段,提取自美国专利商标局(USPTO,United States Patent and Trademark Office)、欧洲专利局(EPO,European Patent Office)以及世界知识产权组织(WIPO,World Intellectual Property Organization)发布的专利文档。
- **USPTO-Markush**:该数据集包含来自专利文献的75张真实马库什结构骨架图像,提取自美国专利商标局(USPTO,United States Patent and Trademark Office)发布的专利文档。
- **MarkushGrapher-Synthetic**:该数据集包含1000个合成马库什结构。其图像采样策略确保所有马库什结构特征(R基团、'm'与'Sg'区段)的分布均匀一致。
数据集的读取示例可参考 [dataset_explorer.ipynb](https://huggingface.co/datasets/ds4sd/MarkushGrapher-Datasets/blob/main/dataset_explorer.ipynb)。
提供机构:
maas
创建时间:
2025-04-22



