ChemTables: dataset for table classification in chemical patents
收藏Mendeley Data2021-03-11 更新2026-04-09 收录
下载链接:
https://data.mendeley.com/datasets/g7tjh7tbrj/2
下载链接
链接失效反馈官方服务:
资源简介:
Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of the tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Categorisation of tables based on the nature of their content can help to support finding tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. To enable the research on methods for automatic table categorization, we developed a new dataset, called ChemTables, which consists of 788 chemical patent tables with labels of their content type. We also provide a stratified 60:20:20 split for train/dev/test set here, which can be used as a standard split for evaluating methods on table categorization task on this dataset.
化学专利是披露新型化合物与反应的常用渠道,因此是化学与药学研究的重要资源。专利中的关键化学数据常以表格形式呈现,专利文档内的表格数量与体量往往十分可观。此外,专利表格可承载多种类型的信息,涵盖光谱与物理数据、化学品的药理用途及效应等。依据内容属性对表格进行分类,有助于快速定位包含关键信息的表格,提升专利中与新发明高度相关的信息的可及性。为推动自动表格分类方法的研究,我们构建了名为ChemTables的全新数据集,该数据集包含788张标注了内容类型的化学专利表格。本次我们还提供了针对训练/验证/测试集的分层60:20:20划分方案,可作为评估该数据集上表格分类任务方法的标准划分方式。
创建时间:
2021-03-11



