PubTables-1M_OTSL
收藏魔搭社区2025-12-05 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/PubTables-1M_OTSL
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for PubTables-1M_OTSL
## Dataset Description
- **Homepage:** https://ds4sd.github.io
- **Paper:** https://arxiv.org/pdf/2305.03393
### Dataset Summary
This dataset enables the evaluation of both object detection models and image-to-text methods.
[PubTables-1M](https://github.com/microsoft/table-transformer) is introduced in the publication *"PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents"* by Smock et al. The conversion into HF (Hugging Face) and the addition of the OTSL (Optimized Table Structure Language) format is presented in our paper "Optimized Table Tokenization for Table Structure Recognition" by Lysak et al. The dataset includes the original annotations amongst new additions.
### Dataset Structure
* cells: origunal dataset cell groundtruth (content).
* table_bbox: origunal dataset table detection groundtruth.
* otsl: new reduced table structure token format
* html: Generated HTML for PubTables-1M to match PubTabNet, FinTabNet, and SynthTabNet format.
* html_restored: generated HTML from OTSL.
* cols: grid column length.
* rows: grid row length.
* image: PIL image
### OTSL Vocabulary:
**OTSL**: new reduced table structure token format
More information on the OTSL table structure format and its concepts can be read from our paper.
Format of this dataset extends work presented in a paper, and introduces slight modifications:
* "fcel" - cell that has content in it
* "ecel" - cell that is empty
* "lcel" - left-looking cell (to handle horizontally merged cells)
* "ucel" - up-looking cell (to handle vertically merged cells)
* "xcel" - 2d span cells, in this dataset - covers entire area of a merged cell
* "nl" - new line token
### Data Splits
The dataset provides three splits
- `train`
- `val`
- `test`
## Additional Information
### Dataset Curators
The dataset is converted by the [Deep Search team](https://ds4sd.github.io/) at IBM Research.
You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com).
Curators:
- Maksym Lysak, [@maxmnemonic](https://github.com/maxmnemonic)
- Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial)
- Christoph Auer, [@cau-git](https://github.com/cau-git)
- Nikos Livathinos, [@nikos-livathinos](https://github.com/nikos-livathinos)
- Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
### Citation Information
**Citation to OTSL Paper:**
@article{lysak2023optimized,
title={Optimized Table Tokenization for Table Structure Recognition},
author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar},
year={2023},
eprint={2305.03393},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
**Citation to PubTables-1M creators:**
@inproceedings{smock2022pubtables,
title={Pub{T}ables-1{M}: Towards comprehensive table extraction from unstructured documents},
author={Smock, Brandon and Pesala, Rohith and Abraham, Robin},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages={4634-4642},
year={2022},
month={June}
}
# PubTables-1M_OTSL 数据集卡片
## 数据集描述
- **项目主页:** https://ds4sd.github.io
- **相关论文:** https://arxiv.org/pdf/2305.03393
### 数据集概述
本数据集可用于目标检测模型与图像到文本方法的评估。
PubTables-1M 由Smock等人在论文*"PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents"*中提出。本数据集已转换至HF(Hugging Face)格式,并新增了OTSL(Optimized Table Structure Language,优化表格结构语言)格式,相关细节见于我们团队Lysak等人的论文《Optimized Table Tokenization for Table Structure Recognition》。本数据集在保留原始标注的基础上新增了多项内容。
### 数据集结构
* `cells`:原始数据集单元格真值标注(内容)
* `table_bbox`:原始数据集表格检测真值标注
* `otsl`:新增的精简表格结构Token格式
* `html`:为适配PubTabNet、FinTabNet与SynthTabNet格式而生成的HTML文件
* `html_restored`:基于OTSL格式还原生成的HTML文件
* `cols`:表格网格列数
* `rows`:表格网格行数
* `image`:PIL格式图像
### OTSL词汇表
**OTSL(Optimized Table Structure Language,优化表格结构语言)**:新增的精简表格结构Token格式
有关OTSL表格结构格式及其设计理念的更多细节,可参阅我们的相关论文。本数据集的格式在已有研究的基础上进行了扩展,并引入了小幅修改:
* `fcel`:带有内容的单元格
* `ecel`:空单元格
* `lcel`:左向单元格(用于处理水平合并单元格)
* `ucel`:上向单元格(用于处理垂直合并单元格)
* `xcel`:二维跨度单元格,在本数据集中指覆盖合并单元格全部区域的单元格
* `nl`:换行Token
### 数据划分
本数据集提供三个数据划分:
- `train`(训练集)
- `val`(验证集)
- `test`(测试集)
## 附加信息
### 数据集维护者
本数据集由IBM研究院的Deep Search团队(https://ds4sd.github.io/)完成格式转换。可通过邮箱deepsearch-core@zurich.ibm.com联系我们。
维护者列表:
- Maksym Lysak,[@maxmnemonic](https://github.com/maxmnemonic)
- Ahmed Nassar,[@nassarofficial](https://github.com/nassarofficial)
- Christoph Auer,[@cau-git](https://github.com/cau-git)
- Nikos Livathinos,[@nikos-livathinos](https://github.com/nikos-livathinos)
- Peter Staar,[@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
### 引用信息
**OTSL相关论文引用:**
@article{lysak2023optimized,
title={Optimized Table Tokenization for Table Structure Recognition},
author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar},
year={2023},
eprint={2305.03393},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
**PubTables-1M原创作者引用:**
@inproceedings{smock2022pubtables,
title={Pub{T}ables-1{M}: Towards comprehensive table extraction from unstructured documents},
author={Smock, Brandon and Pesala, Rohith and Abraham, Robin},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages={4634-4642},
year={2022},
month={June}
}
提供机构:
maas
创建时间:
2025-01-20



