five

Document Visibilty Graph Threshold Estimation Dataset

收藏
NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/3984948
下载链接
链接失效反馈
官方服务:
资源简介:
The aim of this dataset is to help with an estimation of thresholds used in geometrical algorithms for the creation of Visibility Graphs out of document content. For that purpose, the following thresholds are optimal values, leading to an maximal Area F1 for Table Region Detection tasks where those thresholds are the basis. Prediction target thresholds are: x_eps: alignment epsilon for vertical edges in points y_eps: alignment epsilon for horizontal edges in points page_ratio_x: maximal relative horizontal distance of two nodes where an edge can be created page_ratio_y: maximal relative vertical distance of two nodes where an edge can be created threshold_page_width: Indicating at maximal which width of a node the width should be added as an edge condition width_pct_eps: relative width difference of nodes as a condition for vertical edges font_eps: Font size difference between two nodes in points, acting again as an edge condition Independent variables here are: font_size_entropy: Shannon entropy of font sizes in a document, related to if a comparision of font sizes would be meaninful or is frequently present font_name_entropy: Shannon entropy of font names in a document bold_pct: percentage of bold texts in a document italic_pct: percentage of italic texts in a document x_var: deviation of the coordinate-based horizontal differences between nodes y_var: deviation of the coordinate-based vertical differences between nodes avg_width: average width of textual elements The corresponding PDF documents used will be referred in upcoming versions.
创建时间:
2020-08-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作