PubTables-1M_OTSL-v1.1

Name: PubTables-1M_OTSL-v1.1
Creator: maas
Published: 2025-12-05 16:22:42
License: 暂无描述

魔搭社区2025-12-05 更新2025-02-08 收录

下载链接：

https://modelscope.cn/datasets/ds4sd/PubTables-1M_OTSL-v1.1

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for PubTables-1M_OTSL ## Dataset Description - **Homepage:** https://ds4sd.github.io - **Paper:** https://arxiv.org/pdf/2305.03393 ### Dataset Summary **This dataset contains tables enriched with information about headers, it is filtered version of original PubTables-1M, with less samples.** This dataset enables the evaluation of both object detection models and image-to-text methods. [PubTables-1M](https://github.com/microsoft/table-transformer) is introduced in the publication *"PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents"* by Smock et al. The conversion into HF (Hugging Face) and the addition of the OTSL (Optimized Table Structure Language) format is presented in our paper "Optimized Table Tokenization for Table Structure Recognition" by Lysak et al. The dataset includes the original annotations amongst new additions. ### Dataset Structure * cells: origunal dataset cell groundtruth (content). * table_bbox: origunal dataset table detection groundtruth. * otsl: new reduced table structure token format * html: Generated HTML for PubTables-1M to match PubTabNet, FinTabNet, and SynthTabNet format. * html_restored: generated HTML from OTSL. * cols: grid column length. * rows: grid row length. * html_with_text: list of HTML table structure tags together with cell content text * image: PIL image ### OTSL Vocabulary: **OTSL**: new reduced table structure token format More information on the OTSL table structure format and its concepts can be read from our paper. Format of this dataset extends work presented in a paper, and introduces slight modifications: * "fcel" - cell that has content in it * "ecel" - cell that is empty * "lcel" - left-looking cell (to handle horizontally merged cells) * "ucel" - up-looking cell (to handle vertically merged cells) * "xcel" - 2d span cells, in this dataset - covers entire area of a merged cell * "nl" - new line token * "ched" - cell that belongs to column header * "rhed" - cell that belongs to row header * "srow" - cell that belongs to section row (header-like separator within the table) ### Data Splits The dataset provides three splits - `train` - `val` - `test` ## Additional Information ### Dataset Curators The dataset is converted by the [Deep Search team](https://ds4sd.github.io/) at IBM Research. You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com). Curators: - Maksym Lysak, [@maxmnemonic](https://github.com/maxmnemonic) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Christoph Auer, [@cau-git](https://github.com/cau-git) - Nikos Livathinos, [@nikos-livathinos](https://github.com/nikos-livathinos) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### Citation Information **Citation to OTSL Paper:** @article{lysak2023optimized, title={Optimized Table Tokenization for Table Structure Recognition}, author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar}, year={2023}, eprint={2305.03393}, archivePrefix={arXiv}, primaryClass={cs.CV} } **Citation to PubTables-1M creators:** @inproceedings{smock2022pubtables, title={Pub{T}ables-1{M}: Towards comprehensive table extraction from unstructured documents}, author={Smock, Brandon and Pesala, Rohith and Abraham, Robin}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages={4634-4642}, year={2022}, month={June} }

# PubTables-1M_OTSL 数据集卡片 ## 数据集说明 - **数据集主页**：https://ds4sd.github.io - **相关论文**：https://arxiv.org/pdf/2305.03393 ### 数据集概述本数据集包含丰富了表头信息的表格，是原始PubTables-1M的过滤版本，样本量更少。本数据集可用于评估目标检测模型与图像转文本方法。 PubTables-1M（PubTables-1M）由Smock等人在论文《PubTables-1M：面向非结构化文档的全面表格抽取》中提出。本研究团队Lysak等人在论文《面向表格结构识别的优化表格分词》中，完成了数据集向HF（Hugging Face）格式的转换，并新增了OTSL（Optimized Table Structure Language）格式。本数据集在新增内容的基础上保留了原始标注信息。 ### 数据集结构 * cells：原始数据集的单元格标注真值（内容）。 * table_bbox：原始数据集的表格检测标注真值。 * otsl：新增的精简表格结构Token格式 * html：为适配PubTabNet、FinTabNet与SynthTabNet格式而生成的HTML内容。 * html_restored：基于OTSL格式生成的HTML内容。 * cols：表格网格的列数。 * rows：表格网格的行数。 * html_with_text：包含HTML表格结构标签与单元格内容文本的列表 * image：PIL图像 ### OTSL词汇表 **OTSL（Optimized Table Structure Language）**：新增的精简表格结构Token格式关于OTSL表格结构格式及其相关概念的详细说明，请参阅本团队的研究论文。本数据集的格式在已有论文工作的基础上进行了扩展，并引入了小幅修改： * "fcel"：包含内容的单元格 * "ecel"：空单元格 * "lcel"：左向单元格（用于处理水平合并单元格） * "ucel"：上向单元格（用于处理垂直合并单元格） * "xcel"：二维跨度单元格，在本数据集中指代覆盖合并单元格全部区域的单元格 * "nl"：换行标记 * "ched"：属于列表头的单元格 * "rhed"：属于行表头的单元格 * "srow"：属于分区行的单元格（表格内类似表头的分隔行） ### 数据划分本数据集提供三个划分集： - `train`：训练集 - `val`：验证集 - `test`：测试集 ## 补充信息 ### 数据集管理者本数据集由IBM研究院的Deep Search团队（https://ds4sd.github.io/）完成转换。可通过邮箱deepsearch-core@zurich.ibm.com与我们取得联系。管理者： - Maksym Lysak，[@maxmnemonic](https://github.com/maxmnemonic) - Ahmed Nassar，[@nassarofficial](https://github.com/nassarofficial) - Christoph Auer，[@cau-git](https://github.com/cau-git) - Nikos Livathinos，[@nikos-livathinos](https://github.com/nikos-livathinos) - Peter Staar，[@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### 引用信息 **OTSL论文引用：** @article{lysak2023optimized, title={Optimized Table Tokenization for Table Structure Recognition}, author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar}, year={2023}, eprint={2305.03393}, archivePrefix={arXiv}, primaryClass={cs.CV} } **PubTables-1M原作者引用：** @inproceedings{smock2022pubtables, title={Pub{T}ables-1{M}: Towards comprehensive table extraction from unstructured documents}, author={Smock, Brandon and Pesala, Rohith and Abraham, Robin}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages={4634-4642}, year={2022}, month={June} }

提供机构：

maas

创建时间：

2025-02-07

搜集汇总

数据集介绍