ds4sd/PubTabNet_OTSL
收藏Hugging Face2023-08-31 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ds4sd/PubTabNet_OTSL
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
pretty_name: PubTabNet-OTSL
size_categories:
- 10K<n<100K
tags:
- table-structure-recognition
- table-understanding
- PDF
task_categories:
- object-detection
- table-to-text
---
# Dataset Card for PubTabNet_OTSL
## Dataset Description
- **Homepage:** https://ds4sd.github.io
- **Paper:** https://arxiv.org/pdf/2305.03393
### Dataset Summary
This dataset is a conversion of the original [PubTabNet](https://developer.ibm.com/exchanges/data/all/pubtabnet/) into the OTSL format presented in our paper "Optimized Table Tokenization for Table Structure Recognition". The dataset includes the original annotations amongst new additions.
### Dataset Structure
* cells: origunal dataset cell groundtruth (content).
* otsl: new reduced table structure token format
* html: original dataset groundtruth HTML (structure).
* html_restored: generated HTML from OTSL.
* cols: grid column length.
* rows: grid row length.
* image: PIL image
### OTSL Vocabulary:
**OTSL**: new reduced table structure token format
More information on the OTSL table structure format and its concepts can be read from our paper.
Format of this dataset extends work presented in a paper, and introduces slight modifications:
* "fcel" - cell that has content in it
* "ecel" - cell that is empty
* "lcel" - left-looking cell (to handle horizontally merged cells)
* "ucel" - up-looking cell (to handle vertically merged cells)
* "xcel" - 2d span cells, in this dataset - covers entire area of a merged cell
* "nl" - new line token
### Data Splits
The dataset provides three splits
- `train`
- `val`
## Additional Information
### Dataset Curators
The dataset is converted by the [Deep Search team](https://ds4sd.github.io/) at IBM Research.
You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com).
Curators:
- Maksym Lysak, [@maxmnemonic](https://github.com/maxmnemonic)
- Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial)
- Christoph Auer, [@cau-git](https://github.com/cau-git)
- Nikos Livathinos, [@nikos-livathinos](https://github.com/nikos-livathinos)
- Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
### Citation Information
```bib
@misc{lysak2023optimized,
title={Optimized Table Tokenization for Table Structure Recognition},
author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar},
year={2023},
eprint={2305.03393},
archivePrefix={arXiv},
primaryClass={cs.CV}
}```
提供机构:
ds4sd
原始信息汇总
数据集概述
数据集名称
- 名称: PubTabNet-OTSL
许可证
- 许可证: other
大小分类
- 大小: 10K<n<100K
标签
- 标签:
- table-structure-recognition
- table-understanding
任务分类
- 任务:
- object-detection
- table-to-text
数据集结构
- 结构:
- cells: 原始数据集单元格标注(内容)
- otsl: 新的简化表格结构标记格式
- html: 原始数据集标注的HTML(结构)
- html_restored: 从OTSL生成的HTML
- cols: 网格列长度
- rows: 网格行长度
- image: PIL图像
OTSL词汇
- OTSL: 新的简化表格结构标记格式
- "fcel": 有内容的单元格
- "ecel": 空单元格
- "lcel": 向左看的单元格(处理水平合并的单元格)
- "ucel": 向上看的单元格(处理垂直合并的单元格)
- "xcel": 2D跨度单元格,在本数据集中覆盖合并单元格的整个区域
- "nl": 新行标记
数据分割
- 分割:
- train
- val
数据集创建者
- 创建者: Deep Search团队,IBM Research
- 联系: deepsearch-core@zurich.ibm.com
- 成员:
- Maksym Lysak, @maxmnemonic
- Ahmed Nassar, @nassarofficial
- Christoph Auer, @cau-git
- Nikos Livathinos, @nikos-livathinos
- Peter Staar, @PeterStaar-IBM
引用信息
bib @misc{lysak2023optimized, title={Optimized Table Tokenization for Table Structure Recognition}, author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar}, year={2023}, eprint={2305.03393}, archivePrefix={arXiv}, primaryClass={cs.CV} }



