five

docling-project/MarkushGrapher-Datasets

收藏
Hugging Face2025-06-05 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/docling-project/MarkushGrapher-Datasets
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 configs: - config_name: m2s data_files: - split: test path: "m2s/test/*.arrow" - config_name: uspto-markush data_files: - split: test path: "uspto-markush/test/*.arrow" - config_name: markushgrapher-synthetic data_files: - split: test path: "markushgrapher-synthetic/test/*.arrow" - config_name: markushgrapher-synthetic-training data_files: - split: train path: "markushgrapher-synthetic-training/train/*.arrow" - split: test path: "markushgrapher-synthetic-training/test/*.arrow" --- <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/64d38f55f8082bf19b7339e0/V43x-_idEdiCQIfbm0eVM.jpeg" alt="Description" width="800"> </div> This repository contains datasets introduced in [MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures](https://github.com/DS4SD/MarkushGrapher). Training: - **MarkushGrapher-Synthetic-Training**: This set contains synthetic Markush structures used for training MarkushGrapher. Samples are synthetically generated using the following steps: (1) SMILES to CXSMILES conversion using RDKit; (2) CXSMILES rendering using CDK; (3) text description generation using templates; and (4) text description augmentation with LLM. Benchmarks: - **M2S**: This set contains 103 real Markush structures from patent documents. Samples are crops of both Markush structure backbone images and their textual descriptions. They are extracted from documents published in USPTO, EPO and WIPO. - **USPTO-Markush**: This set contains 75 real Markush structure backbone images from patent documents. They are extracted from documents published in USPTO. - **MarkushGrapher-Synthetic**: This set contains 1000 synthetic Markush structures. Its images are sampled such that overall, each Markush features (R-groups, ’m’ and ’Sg’ sections) is represented evenly. An example of how to read the dataset is provided in [dataset_explorer.ipynb](https://huggingface.co/datasets/ds4sd/MarkushGrapher-Datasets/blob/main/dataset_explorer.ipynb).
提供机构:
docling-project
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作