公共事务布局(PAL)数据库
收藏arXiv2023-08-08 更新2024-06-21 收录
下载链接:
https://github.com/BiDAlab/PALdb
下载链接
链接失效反馈官方服务:
资源简介:
公共事务布局(PAL)数据库是由马德里自治大学的生物识别和数据分析实验室开发的一个新型数据集,专注于公共事务领域的文档布局分析。该数据集包含来自西班牙行政机构的24个不同立法来源的近37,910个文档,总计超过441,000页和800万个布局标签。数据集的创建过程涉及半自动标注数字文档的布局标签,包括4个基本布局块和4个文本类别。PAL数据库不仅用于文档布局分析,还为自然语言处理预训练和领域适应提供了丰富的西班牙语及其它4种官方语言的文本资源。该数据集旨在解决自动处理数字PDF文档的挑战,特别是在理解和提取文档中不同组件信息的需求。
The Public Affairs Layout (PAL) database is a novel dataset developed by the Biometrics and Data Analysis Laboratory of the Autonomous University of Madrid, focusing on document layout analysis in the public affairs domain. This dataset contains nearly 37,910 documents from 24 distinct legislative sources of Spanish administrative institutions, totaling over 441,000 pages and 8 million layout tags. The creation of the dataset involves semi-automatically annotating layout tags for digital documents, which include 4 basic layout blocks and 4 text categories. The PAL database not only serves document layout analysis tasks but also provides rich text resources in Spanish and four other official languages for natural language processing pre-training and domain adaptation. This dataset aims to address the challenges associated with automatically processing digital PDF documents, particularly the need to comprehend and extract information from different components within the documents.
提供机构:
马德里自治大学(UAM)生物识别和数据分析实验室(BiDA - Lab)
创建时间:
2023-06-12



