CTE
收藏arXiv2023-02-14 更新2024-06-21 收录
下载链接:
https://github.com/AILab-UniFI/cte-dataset
下载链接
链接失效反馈官方服务:
资源简介:
CTE数据集由佛罗伦萨大学信息工程系创建,包含75000页科学论文的全面标注,其中超过35000个表格。数据来源于PubMed Central,结合了PubTables-1M和PubLayNet数据集的标注信息。该数据集支持多种任务,如文档布局分析、表格检测、结构识别和功能分析。创建过程中,使用PyMuPDF工具从PDF中提取文本和位置信息,并根据区域标注进行分类。CTE数据集特别适用于使用图神经网络的方法,旨在解决科学文献中表格信息的自动提取和分析问题,支持开发端到端的处理系统。
The CTE dataset was developed by the Department of Information Engineering, University of Florence. It contains comprehensively annotated 75,000 pages of scientific papers, including more than 35,000 tables. The dataset is sourced from PubMed Central, and integrates annotation information from the PubTables-1M and PubLayNet datasets. It supports multiple downstream tasks such as document layout analysis, table detection, structural recognition and functional analysis. During its creation, the PyMuPDF tool was used to extract text and positional information from PDF files, and classification was performed based on regional annotations. The CTE dataset is particularly suitable for graph neural network-based approaches, aiming to solve the problem of automatic extraction and analysis of table information in scientific literature, and supports the development of end-to-end processing systems.
提供机构:
佛罗伦萨大学信息工程系
创建时间:
2023-02-03



