ChEMU dataset for information extraction from chemical patents

Mendeley Data2024-01-31 更新2024-06-26 收录

下载链接：

https://data.mendeley.com/datasets/wy6745bjfj

下载链接

链接失效反馈

官方服务：

资源简介：

The discovery of new chemical compounds and their synthesis process is of great importance to the chemical industry. Patent documents contain critical and timely information about newly discovered chemical compounds, providing a rich resource for chemical research in both academia and industry. Chemical patents are often the initial venues where a new chemical compound is disclosed. Only a small proportion of chemical compounds are ever published in journals and these publications can be delayed by up to 3 years after the patent disclosure. In addition, chemical patent documents usually contain unique information, such as reaction steps and experimental conditions for compound synthesis and mode of action. These details are crucial for the understanding of compound prior art, and provide a means for novelty checking and validation. Due to the high volume of chemical patents, approaches that enable automatic information extraction from these patents are in demand. To develop natural language processing methods for large-scale mining of chemical information from patent texts, a corpus is created providing chemical patent snippets and annotated entities and reaction steps.

新型化合物的发现及其合成工艺对化学工业至关重要。专利文献承载着与新发现化合物相关的关键且时效性极强的信息，为学术界与工业界的化学研究提供了丰富资源。化学专利通常是新型化合物首次公开的渠道。仅有极小比例的化合物会在学术期刊上发表，且期刊发表的内容往往较专利公开延迟最多三年。此外，化学专利文献通常包含独特信息，例如化合物合成的反应步骤、实验条件以及作用模式。这些细节对于理解化合物现有技术至关重要，同时为新颖性核查与验证提供了可靠依据。鉴于化学专利的体量庞大，能够从这些专利中自动提取信息的方法需求迫切。为开发可从专利文本中大规模挖掘化学信息的自然语言处理方法，研究人员构建了一个语料库，其中涵盖化学专利片段以及标注实体与反应步骤。

创建时间：

2024-01-31

搜集汇总

数据集介绍

背景与挑战

背景概述

ChEMU数据集是一个专门用于从化学专利中提取信息的文本挖掘资源，旨在支持自然语言处理方法的开发，以大规模挖掘专利文本中的化学信息。该数据集包含化学专利片段，并标注了实体和反应步骤，适用于信息提取任务，如命名实体识别和反应步骤分析。数据集以CC BY NC 3.0许可证发布，包含训练和开发集，以及详细的标注指南，适用于学术和工业研究。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集