USPTO-30K

Name: USPTO-30K
Creator: maas
Published: 2026-01-07 14:11:56
License: 暂无描述

魔搭社区2026-01-07 更新2025-01-25 收录

下载链接：

https://modelscope.cn/datasets/ds4sd/USPTO-30K

下载链接

链接失效反馈

官方服务：

资源简介：

# USPTO-30K USPTO-30K is the benchmark dataset introduced in [MolGrapher: Graph-based Visual Recognition of Chemical Structures](https://github.com/DS4SD/MolGrapher). Existing benchmarks for Optical Chemical Structure Recognition have some limitations. Being created using only a few documents, they contain batches of very similar molecules. For example in a patent, a molecule could typically be displayed together with all the substituent of one particular substructure, resulting in large batches of almost identical molecules. Additionally, the existing sets contain molecules of different kinds, including superatom groups and various markush features, which should be evaluated independently. In practice, it is important to delimit on which types of molecules models can be applied. We introduce USPTO-30K, a large-scale benchmark dataset of annotated molecule images, which overcomes these limitations. It is created using the pairs of images and MolFiles by the United States Patent and Trademark Office. Each molecule was independently selected among all the available documents from 2001 to 2020. The set consists of three subsets to decouple the study of clean molecules, molecules with abbreviations and large molecules. - USPTO-10K contains 10,000 clean molecules, i.e. without any abbreviated groups. - USPTO-10K-abb contains 10,000 molecules with superatom groups. - USPTO-10K-L contains 10,000 clean molecules with more than 70 atoms. [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

# USPTO-30K USPTO-30K是收录于论文《MolGrapher：基于图的化学结构视觉识别（Graph-based Visual Recognition of Chemical Structures）》（链接：https://github.com/DS4SD/MolGrapher）的基准数据集。现有的光学化学结构识别（Optical Chemical Structure Recognition）基准数据集存在诸多局限。此类数据集仅基于少量文档构建，且包含大量高度相似的分子样本。例如在专利文献中，某一分子通常会与其特定子结构的所有取代基一同展示，进而产生大批近乎完全一致的分子。此外，现有数据集涵盖多种类别的分子，包括超原子基团（superatom groups）与各类Markush结构特征，此类样本需开展独立评估。实际应用中，明确模型可适用的分子类型至关重要。为此我们推出USPTO-30K，一款带有标注的分子图像大规模基准数据集，有效克服了上述局限。该数据集由美国专利商标局（United States Patent and Trademark Office, USPTO）基于图像与MolFile文件对构建而成。数据集中的每一个分子均从2001年至2020年的全部可用文档中独立筛选得到。本数据集包含三个子集合，以将纯分子、带缩写基团分子以及大分子的研究进行解耦。 - USPTO-10K：包含10000个纯分子样本，即不包含任何缩写基团的分子。 - USPTO-10K-abb：包含10000个带有超原子基团的分子样本。 - USPTO-10K-L：包含10000个原子数超过70的纯分子样本。 [需补充更多信息](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

提供机构：

maas

创建时间：

2025-01-20

搜集汇总

数据集介绍