TableGraph-24K
收藏阿里云天池2026-05-13 更新2024-03-07 收录
下载链接:
https://tianchi.aliyun.com/dataset/126227
下载链接
链接失效反馈官方服务:
资源简介:
现有的表格识别数据集通常仅包含少量的表格数据,或缺少单元格逻辑位 置标注。因此,为了构建一个大规模的基准数据集用于表格图重构任务,我们 从TABLE2LATEX-450K数据集收集数据,并对其进行表格图标注,最终构成 一个包含350K表格的新数据集,即TableGraph-350K。我们使用原数据集的划分方 式将TableGraph-350K划分为训练集、验证集和测试集。他们分别包含了343,988, 7,420和7,359个表格。TableGraph-350K中表格行与列的最大索引值分别是48和27。 考虑到在整个TableGraph-350K数据集上训练模型需要耗费大量的时间和计算资 源,为了方便学术社区的交流,我们从TableGraph-350K中随机选取了24K数据构 成一个子集,即TableGraph-24K,其中,训练集、验证集和测试集包含的样本数 量分别为20,000,2,000和2,000。TableGraph-24K中表格行与列的最大索引值分别 是37和21。
Existing table recognition datasets usually only contain a small volume of table data, or lack logical position annotations for cells. To construct a large-scale benchmark dataset for the table graph reconstruction task, we collected data from the TABLE2LATEX-450K dataset and performed table graph annotations on it, ultimately forming a new dataset containing 350K tables, namely TableGraph-350K. We split TableGraph-350K into training, validation, and test sets using the splitting strategy of the original dataset. These sets contain 343,988, 7,420, and 7,359 tables respectively. The maximum row and column indices of tables in TableGraph-350K are 48 and 27, respectively. Considering that training a model on the full TableGraph-350K dataset requires considerable time and computational resources, to facilitate academic exchanges within the community, we randomly selected 24K samples from TableGraph-350K to form a subset named TableGraph-24K. The training, validation, and test sets of TableGraph-24K contain 20,000, 2,000, and 2,000 samples respectively. The maximum row and column indices of tables in TableGraph-24K are 37 and 21, respectively.
提供机构:
阿里云天池
创建时间:
2022-04-13
搜集汇总
数据集介绍

背景与挑战
背景概述
TableGraph-24K是一个包含24,000个表格的数据集,专为表格图重构任务设计,提供单元格逻辑位置标注。数据集分为20,000个训练样本、2,000个验证样本和2,000个测试样本,方便学术社区使用。
以上内容由遇见数据集搜集并总结生成



