ChEMU dataset for information extraction from chemical patents

Mendeley Data2024-01-31 更新2024-06-26 收录

下载链接：

https://data.mendeley.com/datasets/wy6745bjfj

下载链接

链接失效反馈

资源简介：

The discovery of new chemical compounds and their synthesis process is of great importance to the chemical industry. Patent documents contain critical and timely information about newly discovered chemical compounds, providing a rich resource for chemical research in both academia and industry. Chemical patents are often the initial venues where a new chemical compound is disclosed. Only a small proportion of chemical compounds are ever published in journals and these publications can be delayed by up to 3 years after the patent disclosure. In addition, chemical patent documents usually contain unique information, such as reaction steps and experimental conditions for compound synthesis and mode of action. These details are crucial for the understanding of compound prior art, and provide a means for novelty checking and validation. Due to the high volume of chemical patents, approaches that enable automatic information extraction from these patents are in demand. To develop natural language processing methods for large-scale mining of chemical information from patent texts, a corpus is created providing chemical patent snippets and annotated entities and reaction steps.

创建时间：

2024-01-31

AI搜集汇总

数据集介绍

背景与挑战

背景概述

ChEMU数据集是一个专门用于从化学专利中提取信息的文本挖掘资源，旨在支持自然语言处理方法的开发，以大规模挖掘专利文本中的化学信息。该数据集包含化学专利片段，并标注了实体和反应步骤，适用于信息提取任务，如命名实体识别和反应步骤分析。数据集以CC BY NC 3.0许可证发布，包含训练和开发集，以及详细的标注指南，适用于学术和工业研究。

以上内容由AI搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集