Cluster-based Table Detection Dataset

NIAID Data Ecosystem2026-03-11 收录

下载链接：

https://zenodo.org/record/3984913

下载链接

链接失效反馈

官方服务：

资源简介：

This data set contains PDF segments and document features, combined with a label whether a segment is part of a table or not. The contained features are: file: Corresponding PDF file name page: Page where the cluster is located, starting with 0 bbox: Bounding box of the cluster, stored as (x_0,x_1,y_0,y_1) text: This information had to be removed because it can be confidential. n_nodes: Number of layout elements in the cluster approx_size: Approximate number of cells when assuming a tabular cluster structure tabular_fill_score: Percentage of filled cells. This is calculated by building an artificial grid over the cluster and set the cells which would be filled relative to the maximal possible number, namely the approx_size. loop_score: Percentage of loops present in a cluster, relative to the maximal possible number. rectangle_score: Percentage of unique rectangles in a cluster, relative to the maximal possible number (which would be one rectangle per element) x_sparsity_abs: Average length of horizontal edges in a cluster x_sparsity_rel: Average length of horizontal edges in a cluster (i.e. x_sparsity_abs), relative to the horizontal sparsity the same page font_size_entropy: Shannon entropy of the font sizes in a cluster font_name_entropy: Shannon entropy of the font names in a cluster bold_pct: Percentage of bold texts in a cluster italic_pct: Percentage of italic texts in a cluster font_size_entropy_doc: Shannon entropy of the font sizes in a document font_name_entropy_doc: Shannon entropy of the font names in a document bold_pct_doc: Percentage of bold texts in a document italic_pct_doc: Percentage of italic texts in a document font_size_entropy_diff: Deviation of the font size entropy of a cluster (i.e. font_size_entropy) compared to the corresponding measurement on document-level (i.e. font_size_entropy_doc) font_name_entropy_diff: Deviation of the font name entropy of a cluster (i.e. font_name_entropy) compared to the corresponding measurement on document-level (i.e. font_name_entropy_doc) bold_pct_diff: Deviation of the percentage of bold texts in a cluster (i.e. bold_pct) compared to the corresponding measurement on document-level (i.e. bold_pct_doc) italic_pct_doc_diff: Deviation of the percentage of italic texts in a cluster (i.e. italic_pct) compared to the corresponding measurement on document-level (i.e. italic_pct_doc) is_table: Label indicating with 1 that a cluster contains pure table content and 0 otherwise

创建时间：

2020-08-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集