SciBank: A Large Dataset of Annotated Scientific Paper Regions for Document Layout Analysis
收藏DataCite Commons2024-11-06 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/open-access/scibank-large-dataset-annotated-scientific-paper-regions-document-layout-analysis
下载链接
链接失效反馈官方服务:
资源简介:
Document layout analysis (DLA) plays an important role for identifying and classifying the different regions of digital documents in the context of Document Understanding tasks. In light of this, SciBank seeks to provide a considerable amount of data from text (abstract, text blocks, caption, keywords, reference, section, subsection, title), tables, figures and equations (isolated equations and inline equations) of 74435 scientific articles pages. Human curators validated that these 12 regions were properly labeled. Moreover, SciBank offers relevant information from unstructured data, which in turn might be used later by machine learning models. We aim at complementing the current DLA datasets available such as Publaynet, TableBank, and DocBank. Different than these publicly available datasets, our main contributions is the inclusion of inline equation annotated regions.
提供机构:
IEEE DataPort
创建时间:
2022-03-14



