Table Extraction PDF Dataset
收藏universe.roboflow.com2022-11-04 更新2025-01-15 收录
下载链接:
https://universe.roboflow.com/mohamed-traore-2ekkp/table-extraction-pdf
下载链接
链接失效反馈官方服务:
资源简介:
The dataset comes from [Devashish Prasad](https://github.com/DevashishPrasad), [Ayan Gadpal](https://github.com/ayangadpal), [Kshitij Kapadni](https://github.com/kshitijkapadni), [Manish Visave](https://github.com/ManishDV), and Kavita Sultanpure - creators of [CascadeTabNet](https://github.com/DevashishPrasad/CascadeTabNet).
**Depending on the dataset version downloaded, the images will include annotations for *'borderless' tables*, *'bordered' tables'*, and *'cells'*.** Borderless tables are those in which every cell in the table does not have a border. Bordered tables are those in which every cell in the table has a border, and the table is bordered. Cells are the individual data points within the table.
A subset of the full dataset, the [ICDAR Table Cells Dataset](https://drive.google.com/drive/folders/19qMDNMWgw04T0HCQ_jADq1OvycF3zvuO), was extracted and imported to Roboflow to create this hosted version of the Cascade TabNet project. All the additional dataset components used in the full project are available here: [All Files](https://drive.google.com/drive/folders/1mNDbbhu-Ubz87oRDjdtLA4BwQwwNOO-G).
## Versions:
1. **Version 1, raw-images** : 342 raw images of tables. No augmentations, preprocessing step of auto-orient was all that was added.
2. **Version 2, tableBordersOnly-rawImages** : 342 raw images of tables. This dataset version contains the same images as version 1, but with the caveat of [Modify Classes](https://docs.roboflow.com/image-transformations/image-preprocessing#modify-classes) being applied to *omit the 'cell' class from all images* (rendering these images to be apt for creating a model to detect 'borderless' tables and 'bordered' tables.
For the versions below: Preprocessing step of Resize (416by416 Fit within-white edges) was added along with more augmentations to increase the size of the training set and to make our images more uniform. Preprocessing applies to *all* images whereas augmentations only apply to *training set images*.
3. **Version 3, augmented-FAST-model** : 818 raw images of tables. [Trained from Scratch](https://www.loom.com/share/0c909764d6794fadb759b8a58c715323) ([no transfer learning](https://blog.roboflow.com/a-primer-on-transfer-learning/)) with the "Fast" model from [Roboflow Train](https://docs.roboflow.com/train). 3X augmentation (generated images).
4. **Version 4, augmented-ACCURATE-model** : 818 raw images of tables. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation.
5. **Version 5, tableBordersOnly-augmented-FAST-model** : 818 raw images of tables. 'Cell' class ommitted with [Modify Classes](https://docs.roboflow.com/image-transformations/image-preprocessing#modify-classes). Trained from Scratch with the "Fast" model from Roboflow Train. 3X augmentation.
6. **Version 6, tableBordersOnly-augmented-ACCURATE-model** : 818 raw images of tables. 'Cell' class ommitted with [Modify Classes](https://docs.roboflow.com/image-transformations/image-preprocessing#modify-classes). Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation.
Example Image from the Dataset
Cascade TabNet in Action
CascadeTabNet is an automatic table recognition method for interpretation of tabular data in document images. We present an improved deep learning-based end to end approach for solving both problems of table detection and structure recognition using a single Convolution Neural Network (CNN) model. CascadeTabNet is a Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet) based model that detects the regions of tables and recognizes the structural body cells from the detected tables at the same time. We evaluate our results on ICDAR 2013, ICDAR 2019 and TableBank public datasets. We achieved 3rd rank in ICDAR 2019 post-competition results for table detection while attaining the best accuracy results for the ICDAR 2013 and TableBank dataset. We also attain the highest accuracy results on the ICDAR 2019 table structure recognition dataset.
## From the Original Authors:
If you find this work useful for your research, please cite our paper:
@misc{ cascadetabnet2020,
title={CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents},
author={Devashish Prasad and Ayan Gadpal and Kshitij Kapadni and Manish Visave and Kavita Sultanpure},
year={2020},
eprint={2004.12629},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
该数据集源自[Devashish Prasad](https://github.com/DevashishPrasad)、[Ayan Gadpal](https://github.com/ayangadpal)、[Kshitij Kapadni](https://github.com/kshitijkapadni)、[Manish Visave](https://github.com/ManishDV)以及Kavita Sultanpure(均为[CascadeTabNet](https://github.com/DevashishPrasad/CascadeTabNet)的创作者)。根据所下载的数据集版本,图像将包含对*“无边框”表格*、*“有边框”表格*和*“单元格”*的标注。无边框表格指的是表格中的每个单元格均无边框;有边框表格则指表格中的每个单元格均有边框且整个表格有边框;单元格是表格中的单个数据点。数据集的子集,即[ICDAR表格单元格数据集](https://drive.google.com/drive/folders/19qMDNMWgw04T0HCQ_jADq1OvycF3zvuO),被提取并导入Roboflow,以创建本项目的托管版本。完整项目中使用的所有附加数据集组件均可在以下链接中找到:[所有文件](https://drive.google.com/drive/folders/1mNDbbhu-Ubz87oRDjdtLA4BwQwwNOO-G)。
## 版本:
1. **版本1,原始图像**:包含342张表格的原始图像。未进行任何增强,仅添加了自动定位预处理步骤。
2. **版本2,仅表格边框-原始图像**:包含342张表格的原始图像。此数据集版本包含与版本1相同的图像,但通过[修改类别](https://docs.roboflow.com/image-transformations/image-preprocessing#modify-classes)操作从所有图像中省略了*“单元格”类别*,使这些图像适用于创建检测*“无边框”表格*和*“有边框”表格*的模型。
以下版本:添加了调整大小(416x416,白色边缘内适应)预处理步骤以及更多增强,以增加训练集的大小并使我们的图像更加统一。预处理适用于*所有图像*,而增强仅适用于*训练集图像*。
3. **版本3,增强-FAST模型**:包含818张表格的原始图像。使用Roboflow Train中的“Fast”模型从零开始[训练](https://www.loom.com/share/0c909764d6794fadb759b8a58c715323)([无迁移学习](https://blog.roboflow.com/a-primer-on-transfer-learning/))。3倍增强(生成图像)。
4. **版本4,增强-ACCURATE模型**:包含818张表格的原始图像。使用Roboflow Train中的“Accurate”模型从零开始训练。3倍增强。
5. **版本5,仅表格边框-增强-FAST模型**:包含818张表格的原始图像。[省略“单元格”类别](https://docs.roboflow.com/image-transformations/image-preprocessing#modify-classes)。使用Roboflow Train中的“Fast”模型从零开始训练。3倍增强。
6. **版本6,仅表格边框-增强-ACCURATE模型**:包含818张表格的原始图像。[省略“单元格”类别](https://docs.roboflow.com/image-transformations/image-preprocessing#modify-classes)。使用Roboflow Train中的“Accurate”模型从零开始训练。3倍增强。
数据集中的示例图像
CascadeTabNet在实际应用中
CascadeTabNet是一种自动表格识别方法,用于解释文档图像中的表格数据。我们提出了一种基于深度学习的端到端方法,使用单个卷积神经网络(CNN)模型解决表格检测和结构识别问题。CascadeTabNet是一种基于Cascade mask Region-based CNN High-Resolution Network(Cascade mask R-CNN HRNet)的模型,它同时检测表格区域并从检测到的表格中识别结构体单元格。我们在ICDAR 2013、ICDAR 2019和TableBank公开数据集上评估了我们的结果。我们在ICDAR 2019竞赛结果中获得了表格检测的第三名,同时在ICDAR 2013和TableBank数据集上达到了最佳准确率。我们在ICDAR 2019表格结构识别数据集上也取得了最高的准确率。
## 原作者建议:
如果您认为这项工作对您的研究有所帮助,请引用我们的论文:
@misc{ cascadetabnet2020,
title={CascadeTabNet:一种从图像文档中端到端检测和结构识别表格的方法},
author={Devashish Prasad and Ayan Gadpal and Kshitij Kapadni and Manish Visave and Kavita Sultanpure},
year={2020},
eprint={2004.12629},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
提供机构:
Roboflow



