Cluster-based Table Detection Dataset
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/3984913
下载链接
链接失效反馈官方服务:
资源简介:
This data set contains PDF segments and document features, combined with a label whether a segment is part of a table or not.
The contained features are:
file: Corresponding PDF file name
page: Page where the cluster is located, starting with 0
bbox: Bounding box of the cluster, stored as (x_0,x_1,y_0,y_1)
text: This information had to be removed because it can be confidential.
n_nodes: Number of layout elements in the cluster
approx_size: Approximate number of cells when assuming a tabular cluster structure
tabular_fill_score: Percentage of filled cells. This is calculated by building an artificial grid over the cluster and set the cells which would be filled relative to the maximal possible number, namely the approx_size.
loop_score: Percentage of loops present in a cluster, relative to the maximal possible number.
rectangle_score: Percentage of unique rectangles in a cluster, relative to the maximal possible number (which would be one rectangle per element)
x_sparsity_abs: Average length of horizontal edges in a cluster
x_sparsity_rel: Average length of horizontal edges in a cluster (i.e. x_sparsity_abs), relative to the horizontal sparsity the same page
font_size_entropy: Shannon entropy of the font sizes in a cluster
font_name_entropy: Shannon entropy of the font names in a cluster
bold_pct: Percentage of bold texts in a cluster
italic_pct: Percentage of italic texts in a cluster
font_size_entropy_doc: Shannon entropy of the font sizes in a document
font_name_entropy_doc: Shannon entropy of the font names in a document
bold_pct_doc: Percentage of bold texts in a document
italic_pct_doc: Percentage of italic texts in a document
font_size_entropy_diff: Deviation of the font size entropy of a cluster (i.e. font_size_entropy) compared to the corresponding measurement on document-level (i.e. font_size_entropy_doc)
font_name_entropy_diff: Deviation of the font name entropy of a cluster (i.e. font_name_entropy) compared to the corresponding measurement on document-level (i.e. font_name_entropy_doc)
bold_pct_diff: Deviation of the percentage of bold texts in a cluster (i.e. bold_pct) compared to the corresponding measurement on document-level (i.e. bold_pct_doc)
italic_pct_doc_diff: Deviation of the percentage of italic texts in a cluster (i.e. italic_pct) compared to the corresponding measurement on document-level (i.e. italic_pct_doc)
is_table: Label indicating with 1 that a cluster contains pure table content and 0 otherwise
创建时间:
2020-08-14



