five

FinTabNet_OTSL-v1.1

收藏
魔搭社区2025-12-12 更新2025-02-08 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/FinTabNet_OTSL-v1.1
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for FinTabNet_OTSL ## Dataset Description - **Homepage:** https://ds4sd.github.io - **Paper:** https://arxiv.org/pdf/2305.03393 ### Dataset Summary **This dataset contains tables enriched with information about headers, it is filtered version of original FinTabNet, with less samples.** This dataset is a conversion of the original [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/) into the OTSL format presented in our paper "Optimized Table Tokenization for Table Structure Recognition". The dataset includes the original annotations amongst new additions. Addition to 1.1 is an extended set of OTSL instructions that now also offer descriptions of column / row headers and section rows. ### Dataset Structure * cells: origunal dataset cell groundtruth (content). * otsl: new reduced table structure token format * html: original dataset groundtruth HTML (structure). * html_restored: generated HTML from OTSL. * cols: grid column length. * rows: grid row length. * html_with_text: list of HTML table structure tags together with cell content text * image: PIL image ### OTSL Vocabulary: **OTSL**: new reduced table structure token format More information on the OTSL table structure format and its concepts can be read from our paper. Format of this dataset extends work presented in a paper, and introduces slight modifications: * "fcel" - cell that has content in it * "ecel" - cell that is empty * "lcel" - left-looking cell (to handle horizontally merged cells) * "ucel" - up-looking cell (to handle vertically merged cells) * "xcel" - 2d span cells, in this dataset - covers entire area of a merged cell * "nl" - new line token * "ched" - cell that belongs to column header * "rhed" - cell that belongs to row header * "srow" - cell that belongs to section row (header-like separator within the table) ### Data Splits The dataset provides three splits - `train` - `val` - `test` ## Additional Information ### Dataset Curators The dataset is converted by the [Deep Search team](https://ds4sd.github.io/) at IBM Research. You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com). Curators: - Maksym Lysak, [@maxmnemonic](https://github.com/maxmnemonic) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Christoph Auer, [@cau-git](https://github.com/cau-git) - Nikos Livathinos, [@nikos-livathinos](https://github.com/nikos-livathinos) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### Citation Information ```bib @misc{lysak2023optimized, title={Optimized Table Tokenization for Table Structure Recognition}, author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar}, year={2023}, eprint={2305.03393}, archivePrefix={arXiv}, primaryClass={cs.CV} }`

# FinTabNet_OTSL 数据集卡片 ## 数据集概述 - **官网**:https://ds4sd.github.io - **论文**:https://arxiv.org/pdf/2305.03393 ### 数据集摘要 本数据集包含带有表头信息的增强型表格,是原始FinTabNet的筛选版本,样本量更少。 本数据集是将原始[FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/)转换为我们论文《用于表格结构识别的优化表格Token化》(Optimized Table Tokenization for Table Structure Recognition)中提出的OTSL格式,保留了原始标注并新增了部分内容。 本数据集1.1版本补充了扩展的OTSL指令集,如今可支持列/行表头以及区域行的描述。 ### 数据集结构 * cells:原始数据集的单元格真值标注(内容)。 * otsl:精简型表格结构Token格式 * html:原始数据集的真值HTML(结构)。 * html_restored:基于OTSL生成的HTML文件。 * cols:网格列数。 * rows:网格行数。 * html_with_text:包含HTML表格结构标签与单元格内容文本的列表。 * image:PIL图像 ### OTSL词汇表 **OTSL**:精简型表格结构Token格式 有关OTSL表格结构格式及其相关概念的详细说明,请参阅我们的论文。 本数据集的格式在原论文基础上进行了扩展,并引入了少量修改: * "fcel" - 带有内容的单元格 * "ecel" - 空单元格 * "lcel" - 左向关联单元格(用于处理水平合并单元格) * "ucel" - 上向关联单元格(用于处理垂直合并单元格) * "xcel" - 二维跨度单元格,本数据集中用于覆盖合并单元格的全部区域 * "nl" - 换行Token * "ched" - 属于列表头的单元格 * "rhed" - 属于行表头的单元格 * "srow" - 属于区域行的单元格(表格内类表头分隔行) ### 数据划分 本数据集提供三个划分集: - `train`:训练集 - `val`:验证集 - `test`:测试集 ## 附加信息 ### 数据集维护者 本数据集由IBM研究院的[Deep Search团队](https://ds4sd.github.io/)转换完成。您可通过邮箱[deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com)联系我们。 维护人员: - Maksym Lysak,[@maxmnemonic](https://github.com/maxmnemonic) - Ahmed Nassar,[@nassarofficial](https://github.com/nassarofficial) - Christoph Auer,[@cau-git](https://github.com/cau-git) - Nikos Livathinos,[@nikos-livathinos](https://github.com/nikos-livathinos) - Peter Staar,[@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### 引用信息 bib @misc{lysak2023optimized, title={Optimized Table Tokenization for Table Structure Recognition}, author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar}, year={2023}, eprint={2305.03393}, archivePrefix={arXiv}, primaryClass={cs.CV} }
提供机构:
maas
创建时间:
2025-02-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作