IDEA-AI4S/RxnPatent
收藏Hugging Face2026-03-05 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/IDEA-AI4S/RxnPatent
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-nd-4.0
---
# RxnPatent Dataset
The RxnPatent dataset is specifically constructed for the task of chemical reaction flow diagram parsing. It comprises data extracted from World Intellectual Property Organization (WO) and United States (US) patents in chemistry, pharmaceuticals, and related fields.
## Key Features
* **Language Composition**: The textual data primarily consists of Chinese and English, with a smaller number of samples in Japanese and Korean.
* **Data Format**: Follows the annotation format consistent with the [RxnScribe](https://github.com/thomas0809/RxnScribe) project. Please refer to its documentation for details.
* **Data Characteristics**:
* Includes both single-step and multi-step reaction flow diagram images.
* The `diagram_type` of images is **not** annotated.
* Contains **negative samples** (images with chemical structures that do *not* depict a reaction) to improve model discrimination.
* **Annotation Quality**: All data annotations were performed manually by professionals holding a Master’s degree in Chemistry, ensuring high accuracy and reliability.
## Statistics
| Split | Total Images | Images with Reactions | Total Annotated Reactions |
| :---- | :----------: | :-------------------: | :----------------------: |
| Train | 2,281 | 1,084 | 3,918 |
| Test | 162 | 84 | 272 |
## File Structure
RxnPatent/
├── images/ # Directory containing all patent images
├── train.json # Annotation file for the training set
└── test.json # Annotation file for the test set
---
许可证:CC BY-NC-ND 4.0(知识共享署名-非商业性使用-禁止演绎4.0国际许可协议)
---
# RxnPatent数据集
本数据集专为化学反应流程图解析任务构建,数据源自世界知识产权组织(WO)及美国(US)化学、制药及相关领域的专利文献。
## 核心特性
* **语言构成**:文本数据以中英文为主,辅以少量日语、韩语样本。
* **数据格式**:采用与[RxnScribe](https://github.com/thomas0809/RxnScribe)项目一致的标注格式,具体细节可参考其官方文档。
* **数据特征**:
* 涵盖单步与多步反应流程图图像
* 图像的`diagram_type`(图表类型)未被标注
* 包含负样本(即带有化学结构但未描述化学反应的图像),用于提升模型的判别性能
* **标注质量**:所有数据标注均由持有化学硕士学位的专业人员手动完成,确保了标注的高准确性与可靠性。
## 统计信息
| 数据集划分 | 总图像数 | 含反应的图像数 | 总标注反应数 |
| :--------- | :------: | :------------: | :----------: |
| 训练集 | 2,281 | 1,084 | 3,918 |
| 测试集 | 162 | 84 | 272 |
## 文件结构
RxnPatent/
├── images/ # 存储所有专利图像的目录
├── train.json # 训练集标注文件
└── test.json # 测试集标注文件
提供机构:
IDEA-AI4S



