Table Detection Dataset for Visually rich Documents

Name: Table Detection Dataset for Visually rich Documents
Creator: Simsek, Murat; Kantarci, Burak; Xiao, Bin; Alkheir, Ala Abu
License: 暂无描述

IEEE2026-04-17 收录

下载链接：

https://ieee-dataport.org/documents/table-detection-dataset-visually-rich-documents

下载链接

链接失效反馈

官方服务：

资源简介：

Table Detection has become a fundamental task for visually rich document understanding with the surging number of electronic documents. There have been some open datasets widely used in many studies. However, popular available datasets have some inherent limitations, including the noisy and inconsistent samples, and the limit number of training samples, and the limit number of data-sources. These limitations make these datasets unreliable to evaluate the model performance and cannot reflect the actual capacity of models.Therefore, in this paper, we revisit some open datasets with high quality of annotations, identify and clean the noise, and align the annotation definitions of these datasets to merge a larger dataset, termed with Open-Tables. Moreover, to enrich the data sources, we propose a new dataset, termed with ICT-TD, using the PDF files of Information and communication technologies (ICT) commodities which is a different domain containing unique samples that hardly appear in open datasets. To ensure the label quality of the dataset, we annotated the dataset manually following the guidance of a domain expert. The proposed dataset has a larger intra-variance and smaller inter-variance, making it more challenging and can be a sample of actual cases in the business context. We built strong baselines using various state-of-the-art object detection models and also built the baselines in the cross-domain setting. Our experimental results show that the domain difference among existing open datasets are small, even they have different data-sources. Our proposed Open-tables and ICT-TD are more suitable for the cross domain setting, and  can provide more reliable evaluation for model because of their high quality and consistent annotations. We conduct experiments to discuss the side effects of noise in the open source datasets. Our experimental results show that in the cross-domain setting, benchmark models trained with cleaned Open-Tables dataset can achieve 0.6\%-2.6\% higher weighted average F1 than the corresponding ones trained with the noisy version of Open-Tables, demonstrating the reliability of the proposed datasets.

提供机构：

Simsek, Murat; Kantarci, Burak; Xiao, Bin; Alkheir, Ala Abu

5,000+

优质数据集

54 个

任务类型

进入经典数据集