TableGraph-24K

Name: TableGraph-24K
Creator: 阿里云天池
Published: 2026-05-13 20:04:43
License: 暂无描述

阿里云天池2026-05-13 更新2024-03-07 收录

下载链接：

https://tianchi.aliyun.com/dataset/126227

下载链接

链接失效反馈

官方服务：

资源简介：

现有的表格识别数据集通常仅包含少量的表格数据，或缺少单元格逻辑位置标注。因此，为了构建一个大规模的基准数据集用于表格图重构任务，我们从TABLE2LATEX-450K数据集收集数据，并对其进行表格图标注，最终构成一个包含350K表格的新数据集，即TableGraph-350K。我们使用原数据集的划分方式将TableGraph-350K划分为训练集、验证集和测试集。他们分别包含了343,988， 7,420和7,359个表格。TableGraph-350K中表格行与列的最大索引值分别是48和27。考虑到在整个TableGraph-350K数据集上训练模型需要耗费大量的时间和计算资源，为了方便学术社区的交流，我们从TableGraph-350K中随机选取了24K数据构成一个子集，即TableGraph-24K，其中，训练集、验证集和测试集包含的样本数量分别为20,000，2,000和2,000。TableGraph-24K中表格行与列的最大索引值分别是37和21。

Existing table recognition datasets usually only contain a small volume of table data, or lack logical position annotations for cells. To construct a large-scale benchmark dataset for the table graph reconstruction task, we collected data from the TABLE2LATEX-450K dataset and performed table graph annotations on it, ultimately forming a new dataset containing 350K tables, namely TableGraph-350K. We split TableGraph-350K into training, validation, and test sets using the splitting strategy of the original dataset. These sets contain 343,988, 7,420, and 7,359 tables respectively. The maximum row and column indices of tables in TableGraph-350K are 48 and 27, respectively. Considering that training a model on the full TableGraph-350K dataset requires considerable time and computational resources, to facilitate academic exchanges within the community, we randomly selected 24K samples from TableGraph-350K to form a subset named TableGraph-24K. The training, validation, and test sets of TableGraph-24K contain 20,000, 2,000, and 2,000 samples respectively. The maximum row and column indices of tables in TableGraph-24K are 37 and 21, respectively.

提供机构：

阿里云天池

创建时间：

2022-04-13

搜集汇总

数据集介绍