PubTabNet_OTSL
收藏魔搭社区2025-12-05 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/PubTabNet_OTSL
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for PubTabNet_OTSL
## Dataset Description
- **Homepage:** https://ds4sd.github.io
- **Paper:** https://arxiv.org/pdf/2305.03393
### Dataset Summary
This dataset is a conversion of the original [PubTabNet](https://developer.ibm.com/exchanges/data/all/pubtabnet/) into the OTSL format presented in our paper "Optimized Table Tokenization for Table Structure Recognition". The dataset includes the original annotations amongst new additions.
### Dataset Structure
* cells: origunal dataset cell groundtruth (content).
* otsl: new reduced table structure token format
* html: original dataset groundtruth HTML (structure).
* html_restored: generated HTML from OTSL.
* cols: grid column length.
* rows: grid row length.
* image: PIL image
### OTSL Vocabulary:
**OTSL**: new reduced table structure token format
More information on the OTSL table structure format and its concepts can be read from our paper.
Format of this dataset extends work presented in a paper, and introduces slight modifications:
* "fcel" - cell that has content in it
* "ecel" - cell that is empty
* "lcel" - left-looking cell (to handle horizontally merged cells)
* "ucel" - up-looking cell (to handle vertically merged cells)
* "xcel" - 2d span cells, in this dataset - covers entire area of a merged cell
* "nl" - new line token
### Data Splits
The dataset provides three splits
- `train`
- `val`
## Additional Information
### Dataset Curators
The dataset is converted by the [Deep Search team](https://ds4sd.github.io/) at IBM Research.
You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com).
Curators:
- Maksym Lysak, [@maxmnemonic](https://github.com/maxmnemonic)
- Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial)
- Christoph Auer, [@cau-git](https://github.com/cau-git)
- Nikos Livathinos, [@nikos-livathinos](https://github.com/nikos-livathinos)
- Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
### Citation Information
```bib
@misc{lysak2023optimized,
title={Optimized Table Tokenization for Table Structure Recognition},
author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar},
year={2023},
eprint={2305.03393},
archivePrefix={arXiv},
primaryClass={cs.CV}
}```
# PubTabNet_OTSL 数据集卡片
## 数据集描述
- **主页**:https://ds4sd.github.io
- **论文**:https://arxiv.org/pdf/2305.03393
### 数据集概览
本数据集是将原始[PubTabNet](https://developer.ibm.com/exchanges/data/all/pubtabnet/)转换为我们在论文《Optimized Table Tokenization for Table Structure Recognition》中提出的OTSL(Optimized Table Tokenization for Table Structure Recognition)格式,数据集在保留原始标注的基础上新增了额外标注内容。
### 数据集结构
* cells:原始数据集单元格真值(内容)
* otsl:新型轻量化表格结构标记格式(OTSL)
* html:原始数据集真值HTML格式(结构)
* html_restored:基于OTSL生成的HTML文件
* cols:表格网格列数
* rows:表格网格行数
* image:PIL图像
### OTSL词汇表
**OTSL**:新型轻量化表格结构标记格式
有关OTSL表格结构格式及其相关概念的更多信息可参阅我们的论文。
本数据集的格式基于已有论文提出的工作,并做了小幅修改:
* `fcel`:包含内容的单元格
* `ecel`:空单元格
* `lcel`:左向关联单元格(用于处理水平合并单元格)
* `ucel`:上向关联单元格(用于处理垂直合并单元格)
* `xcel`:二维跨度单元格,本数据集下涵盖合并单元格的全部区域
* `nl`:换行标记
### 数据划分
本数据集提供两类数据划分:
- `train`:训练集
- `val`:验证集
## 附加信息
### 数据集制作团队
本数据集由IBM研究院的Deep Search团队(https://ds4sd.github.io/)转换制作。
您可通过邮箱 [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com) 联系我们。
制作团队成员:
- 马克西姆·利索克(Maksym Lysak),[@maxmnemonic](https://github.com/maxmnemonic)
- 艾哈迈德·纳萨尔(Ahmed Nassar),[@nassarofficial](https://github.com/nassarofficial)
- 克里斯托夫·奥尔(Christoph Auer),[@cau-git](https://github.com/cau-git)
- 尼科斯·利瓦蒂诺斯(Nikos Livathinos),[@nikos-livathinos](https://github.com/nikos-livathinos)
- 彼得·斯塔尔(Peter Staar),[@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
### 引用信息
bib
@misc{lysak2023optimized,
title={Optimized Table Tokenization for Table Structure Recognition},
author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar},
year={2023},
eprint={2305.03393},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
提供机构:
maas
创建时间:
2025-01-20



