微软表格数据集TableBank
收藏帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-1880.html
下载链接
链接失效反馈官方服务:
资源简介:
TableBank 是一个新的基于图像的表格检测和识别数据集,利用互联网上 Word 和 Latex 文档的新颖弱监督构建,包含 417K 高质量标记表格。 为了解决对标准开放领域表基准数据集的需求,我们提出了一种新颖的弱监督方法来自动创建TableBank,它比现有的用于表分析的人类标记数据集要大几个数量级。区别于传统的弱监督训练集,我们的方法不仅可以获得大规模的,而且是高质量的训练数据。 现在,网络上有大量的电子文件,如Microsoft Word(.docx)和Latex(.tex)文件。这些在线文档的源代码中自然包含了表格的标记标签。直观地说,我们可以通过使用每个文档内的标记语言添加边界框来操作这些源代码。对于Word文档,可以修改Office内部的XML代码,其中每个表格的边界线都被确定。对于Latex文档,也可以修改tex代码,在其中识别表格的边界框。通过这种方式,可以为各种领域创建高质量的标签数据,如商业文件、官方填充物、研究论文等,这对大规模的表格分析任务非常有利。 TableBank 数据集总共包含 417,234 个高质量标记表及其在各个领域的原始文档。 表格检测的目的是利用文档中的边界框来定位表格。给定一个图像格式的文档页面,生成几个代表该页面中表格位置的边界框。 表格结构识别的目的是识别表格的行和列布局结构,特别是在非数字文档格式(如扫描图像)中。给定一个图像格式的表格,生成一个HTML标签序列,代表行和列的排列以及表格单元的类型。 为了验证 Table-Bank 的有效性,我们使用具有端到端深度神经网络的最先进模型构建了几个强大的基线。表格检测模型基于具有不同设置的 Faster R-CNN [Ren et al., 2015] 架构。表结构识别模型基于图像到文本的编码器-解码器框架。 为了评估表格检测,我们从 Word 和 Latex 文档中抽取 18,000 张文档图像,其中 10,000 张图像用于验证,8,000 张图像用于测试。每个采样图像至少包含一个表格。同时,我们还在 ICDAR 2013 数据集上评估了我们的模型,以验证 TableBank 的有效性。为了评估表格结构识别,我们从 Word 和 Latex 文档中抽取 15,000 张表格图像,其中 10,000 张图像用于验证,5,000 张图像用于测试。对于表格检测,我们按照论文中描述的方式计算精度、召回率和 F1,其中所有文档的指标是通过将重叠区域、预测和基本事实相加来计算的。对于表结构识别,我们使用 4-gram BLEU 分数作为具有单一参考的评估指标。 我们使用开源框架 Detectron2 [Wu et al., 2019] 在 TableBank 上训练模型。 Detectron2 是用于对象检测研究的高质量和高性能代码库,它支持许多最先进的算法。在这项任务中,我们使用带有 ResNeXt [Xie et al., 2016] 的 Faster R-CNN 算法作为骨干网络架构,其中参数在 ImageNet 数据集上进行了预训练。所有基线均使用 4 个 V100 NVIDIA GPU 使用数据并行同步 SGD 进行训练,小批量大小为 20 个图像。对于其他参数,我们使用 Detectron2 中的默认值。在测试期间,生成边界框的置信度阈值设置为 90%。 对于表结构识别,我们使用开源框架 OpenNMT [Klein et al., 2017] 来训练图像到文本模型。 OpenNMT 主要是为神经机器翻译而设计的,它支持许多编码器-解码器框架。在这项任务中,我们使用 OpenNMT 中的图像到文本方法来训练我们的模型。该模型还使用 4 个 V100 NVIDIA GPU 进行训练,学习率为 1,batch size 为 24。对于其他参数,我们使用 OpenNMT 中的默认值。 The trained models are available for download in the TableBank Model Zoo. **Please DO NOT re-distribute our data.** If you use the corpus in published work, please cite it referring to the "Paper and Citation" Section. The annotations and original document pictures of the TableBank dataset can be download from the TableBank dataset homepage. https://arxiv.org/abs/1903.01949
TableBank is a novel image-based table detection and recognition dataset constructed via a novel weakly supervised approach using Word and LaTeX documents from the Internet, containing 417K high-quality annotated tables.
To address the demand for standard open-domain table benchmark datasets, we propose a novel weakly supervised method to automatically create TableBank, which is several orders of magnitude larger than existing human-annotated datasets for table analysis. Unlike traditional weakly supervised training datasets, our approach can obtain not only large-scale but also high-quality training data.
Currently, there are vast amounts of electronic documents on the Internet, such as Microsoft Word (.docx) and LaTeX (.tex) files. The source code of these online documents naturally contains annotation tags for tables. Intuitively, we can manipulate these source codes by adding bounding boxes using the markup language within each document. For Word documents, we can modify the internal XML code of Office, where the bounding lines of each table are already defined. For LaTeX documents, we can also modify the .tex code to identify the bounding boxes of tables within them. In this way, we can create high-quality annotated data for various domains, such as business documents, official forms, research papers, etc., which is highly beneficial for large-scale table analysis tasks.
The TableBank dataset contains a total of 417,234 high-quality annotated tables along with their original source documents across various domains.
The goal of table detection is to locate tables using bounding boxes in documents. Given a document page in image format, it generates several bounding boxes representing the positions of tables on that page.
The purpose of table structure recognition is to identify the row and column layout structure of tables, especially in non-digitized document formats such as scanned images. Given a table in image format, it generates an HTML tag sequence representing the arrangement of rows and columns as well as the types of table cells.
To verify the effectiveness of TableBank, we built several strong baselines using state-of-the-art models with end-to-end deep neural networks. The table detection model is based on the Faster R-CNN [Ren et al., 2015] architecture with different settings. The table structure recognition model is based on an image-to-text encoder-decoder framework.
For evaluating table detection, we sampled 18,000 document images from Word and LaTeX documents, with 10,000 images used for validation and 8,000 for testing. Each sampled image contains at least one table. Meanwhile, we also evaluated our models on the ICDAR 2013 dataset to verify the effectiveness of TableBank. For evaluating table structure recognition, we sampled 15,000 table images from Word and LaTeX documents, with 10,000 images used for validation and 5,000 for testing.
For table detection, we calculate precision, recall, and F1 score in the manner described in the paper, where metrics for all documents are computed by summing overlapping regions, predictions, and ground truths. For table structure recognition, we use the 4-gram BLEU score as the evaluation metric with a single reference.
We trained our models on TableBank using the open-source framework Detectron2 [Wu et al., 2019]. Detectron2 is a high-quality and high-performance codebase for object detection research, supporting many state-of-the-art algorithms. In this task, we use the Faster R-CNN algorithm with ResNeXt [Xie et al., 2016] as the backbone network architecture, where the parameters are pre-trained on the ImageNet dataset. All baselines are trained using data-parallel synchronous SGD on 4 NVIDIA V100 GPUs, with a mini-batch size of 20 images. For other parameters, we use the default values in Detectron2. During testing, the confidence threshold for generating bounding boxes is set to 90%.
For table structure recognition, we use the open-source framework OpenNMT [Klein et al., 2017] to train the image-to-text model. OpenNMT is primarily designed for neural machine translation and supports many encoder-decoder frameworks. In this task, we use the image-to-text method in OpenNMT to train our model. This model is also trained on 4 NVIDIA V100 GPUs, with a learning rate of 1 and a batch size of 24. For other parameters, we use the default values in OpenNMT.
The trained models are available for download in the TableBank Model Zoo. **Please DO NOT re-distribute our data.** If you use the corpus in published work, please cite it referring to the "Paper and Citation" Section. The annotations and original document pictures of the TableBank dataset can be downloaded from the TableBank dataset homepage. https://arxiv.org/abs/1903.01949
提供机构:
帕依提提
搜集汇总
数据集介绍

背景与挑战
背景概述
TableBank是一个基于图像的表格检测和识别数据集,包含417K高质量标记表格,利用Word和Latex文档的弱监督方法构建。该数据集支持表格检测和结构识别任务,适用于商业文件、研究论文等多个领域。
以上内容由遇见数据集搜集并总结生成



