SynthTabNet_OTSL
收藏魔搭社区2025-12-12 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/SynthTabNet_OTSL
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for SynthTabNet_OTSL
## Dataset Description
- **Homepage:** https://ds4sd.github.io
- **Paper:** https://arxiv.org/pdf/2305.03393
### Dataset Summary
This dataset is a conversion of the original [SynthTabNet](https://github.com/IBM/SynthTabNet) into the OTSL format presented in our paper "Optimized Table Tokenization for Table Structure Recognition". The dataset includes the original annotations amongst new additions.
SynthTabNet is organized into 4 parts of 150k tables (600k in total). Each part contains tables with different appearances in regard to their size, structure, style and content. All parts are divided into Train, Test and Val splits.
| Appearance style | Records |
|------------------|---------|
| Fintabnet | 150k |
| Marketing | 150k |
| PubTabNet | 150k |
| Sparse | 150k |
### Dataset Structure
* cells: origunal dataset cell groundtruth (content).
* otsl: new reduced table structure token format
* html: original dataset groundtruth HTML (structure).
* html_restored: generated HTML from OTSL.
* cols: grid column length.
* rows: grid row length.
* image: PIL image
### OTSL Vocabulary:
**OTSL**: new reduced table structure token format
More information on the OTSL table structure format and its concepts can be read from our paper.
Format of this dataset extends work presented in a paper, and introduces slight modifications:
* "fcel" - cell that has content in it
* "ecel" - cell that is empty
* "lcel" - left-looking cell (to handle horizontally merged cells)
* "ucel" - up-looking cell (to handle vertically merged cells)
* "xcel" - 2d span cells, in this dataset - covers entire area of a merged cell
* "nl" - new line token
### Data Splits
The dataset provides three splits
- `train`
- `val`
- `test`
## Additional Information
### Dataset Curators
The dataset is converted by the [Deep Search team](https://ds4sd.github.io/) at IBM Research.
You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com).
Curators:
- Maksym Lysak, [@maxmnemonic](https://github.com/maxmnemonic)
- Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial)
- Christoph Auer, [@cau-git](https://github.com/cau-git)
- Nikos Livathinos, [@nikos-livathinos](https://github.com/nikos-livathinos)
- Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
### Citation Information
```bib
@misc{lysak2023optimized,
title={Optimized Table Tokenization for Table Structure Recognition},
author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar},
year={2023},
eprint={2305.03393},
archivePrefix={arXiv},
primaryClass={cs.CV}
}```
# SynthTabNet_OTSL 数据集卡片
## 数据集说明
- **主页**:https://ds4sd.github.io
- **论文**:https://arxiv.org/pdf/2305.03393
### 数据集概述
本数据集将原始[SynthTabNet](https://github.com/IBM/SynthTabNet)转换为我们在论文《Optimized Table Tokenization for Table Structure Recognition》(表格结构识别的优化表格Token化)中提出的OTSL(优化表格结构Token格式)格式,且在保留原有标注的基础上新增了部分标注内容。
SynthTabNet共分为4个部分,每个部分包含15万张表格,总计60万张。各部分的表格在尺寸、结构、样式与内容上均存在差异。所有子数据集均划分为训练集(Train)、测试集(Test)与验证集(Val)三个子集。
| 外观样式 | 表格数量 |
|----------------|---------|
| Fintabnet | 15万 |
| Marketing | 15万 |
| PubTabNet | 15万 |
| Sparse | 15万 |
### 数据集结构
* `cells`:原始数据集的单元格真值标注(内容维度)
* `otsl`:新增的轻量化表格结构Token格式
* `html`:原始数据集的HTML格式真值标注(结构维度)
* `html_restored`:基于OTSL格式生成的HTML文件
* `cols`:表格网格的列数
* `rows`:表格网格的行数
* `image`:PIL格式的表格图像
### OTSL 词表
**OTSL**:轻量化表格结构Token格式
关于OTSL表格结构格式及其相关概念的详细说明,请参阅我们的论文。
本数据集的格式在已有论文工作的基础上进行了扩展,并引入了少量修改:
* "fcel":包含有效内容的单元格
* "ecel":空单元格
* "lcel":左向关联单元格(用于处理水平合并单元格)
* "ucel":上向关联单元格(用于处理垂直合并单元格)
* "xcel":二维跨度单元格,在本数据集中指代覆盖合并单元格全部区域的单元格
* "nl":换行标记
### 数据划分
本数据集提供三个数据子集:
- `train`:训练集
- `val`:验证集
- `test`:测试集
## 补充信息
### 数据集维护者
本数据集由IBM研究院的[Deep Search团队](https://ds4sd.github.io/)转换完成。
您可通过邮箱`deepsearch-core@zurich.ibm.com`与我们取得联系。
维护人员:
- Maksym Lysak,[@maxmnemonic](https://github.com/maxmnemonic)
- Ahmed Nassar,[@nassarofficial](https://github.com/nassarofficial)
- Christoph Auer,[@cau-git](https://github.com/cau-git)
- Nikos Livathinos,[@nikos-livathinos](https://github.com/nikos-livathinos)
- Peter Staar,[@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
### 引用信息
bib
@misc{lysak2023optimized,
title={Optimized Table Tokenization for Table Structure Recognition},
author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar},
year={2023},
eprint={2305.03393},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
提供机构:
maas
创建时间:
2025-01-20
搜集汇总
数据集介绍

背景与挑战
背景概述
SynthTabNet_OTSL是一个表格结构识别数据集,包含600k个不同样式和结构的表格,采用OTSL格式优化标记,适用于表格结构识别研究。数据集由IBM Research团队转换和维护,提供了train、val和test三种分割。
以上内容由遇见数据集搜集并总结生成



