Document Visibilty Graph Threshold Estimation Dataset
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/3984948
下载链接
链接失效反馈官方服务:
资源简介:
The aim of this dataset is to help with an estimation of thresholds used in geometrical algorithms for the creation of Visibility Graphs out of document content.
For that purpose, the following thresholds are optimal values, leading to an maximal Area F1 for Table Region Detection tasks where those thresholds are the basis.
Prediction target thresholds are:
x_eps: alignment epsilon for vertical edges in points
y_eps: alignment epsilon for horizontal edges in points
page_ratio_x: maximal relative horizontal distance of two nodes where an edge can be created
page_ratio_y: maximal relative vertical distance of two nodes where an edge can be created
threshold_page_width: Indicating at maximal which width of a node the width should be added as an edge condition
width_pct_eps: relative width difference of nodes as a condition for vertical edges
font_eps: Font size difference between two nodes in points, acting again as an edge condition
Independent variables here are:
font_size_entropy: Shannon entropy of font sizes in a document, related to if a comparision of font sizes would be meaninful or is frequently present
font_name_entropy: Shannon entropy of font names in a document
bold_pct: percentage of bold texts in a document
italic_pct: percentage of italic texts in a document
x_var: deviation of the coordinate-based horizontal differences between nodes
y_var: deviation of the coordinate-based vertical differences between nodes
avg_width: average width of textual elements
The corresponding PDF documents used will be referred in upcoming versions.
创建时间:
2020-08-14



