OCRFlux-pubtabnet-cross

Name: OCRFlux-pubtabnet-cross
Creator: maas
Published: 2025-12-05 16:40:07
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-05 收录

下载链接：

https://modelscope.cn/datasets/ChatDOC/OCRFlux-pubtabnet-cross

下载链接

链接失效反馈

官方服务：

资源简介：

# OCRFlux-pubtabnet-cross PDF documents are typically paginated, which often results in tables or paragraphs being split across consecutive pages. Accurately detecting and merging such cross-page structures is crucial to avoid generating incomplete or fragmented content. The merging of two table fragments is especially challenging. For example, the table spanning multiple pages will repeat the header of the first page on the second page. Another difficult scenario is that the table cell contains long content that spans multiple lines within the cell, with the first few lines appearing on the previous page and the remaining lines continuing on the next page. We also observe some cases where tables with a large number of columns are split vertically and placed on two consecutive pages. OCRFlux-pubtabnet-cross is a benchmark of 9064 samples which can be used to measure the performance of OCR systems in cross-page table merging. Quick links: - 🤗 [Model](https://huggingface.co/ChatDOC/OCRFlux-3B) - 🛠️ [Code](https://github.com/chatdoc-com/OCRFlux) ## Data Mix We generate the dataset by splitting each original table in [OCRFlux-pubtabnet-single](https://huggingface.co/datasets/ChatDOC/OCRFlux-pubtabnet-single) through diverse splitting strategies simulating the real-world scenarios of cross-page table segmentation. ## Data Format Each row in the dataset corresponds to two table fragments and their corresponding ground-truth merged versions, all in HTML format. ### Features: ```python { 'image_name': string, # Name of the original table image 'type': string, # Type of the original table, "simple" or "complex" 'gt_table': string, # Ground-truth HTML of the original table 'table_fragment_1': string, # HTML of the first table fragment 'table_fragment_2': string, # HTML of the second table fragment } ``` ## License This dataset is licensed under Apache-2.0.

# OCRFlux-pubtabnet-cross PDF文档通常具备分页特性，这常会导致表格或段落被分割在连续的多个页面中。精准检测并合并此类跨页结构，对于避免生成不完整或碎片化的内容至关重要。两个表格片段的合并任务尤为棘手。例如，跨页表格会在第二页重复第一页的表头。另一种复杂场景是，表格单元格内包含跨多行的长文本内容，其中前几行位于前一页，剩余行则延续至后一页。此外我们还观察到，包含大量列的表格会被垂直分割，并放置在两个连续页面上的情况。 OCRFlux-pubtabnet-cross是一个包含9064个样本的基准数据集，可用于评估光学字符识别（Optical Character Recognition，OCR）系统在跨页表格合并任务中的性能。快速链接： - 🤗 [模型](https://huggingface.co/ChatDOC/OCRFlux-3B) - 🛠️ [代码](https://github.com/chatdoc-com/OCRFlux) ## 数据构建我们通过多种分割策略，对[OCRFlux-pubtabnet-single](https://huggingface.co/datasets/ChatDOC/OCRFlux-pubtabnet-single)中的每张原始表格进行分割，以此生成本数据集，这些分割策略模拟了真实场景中的跨页表格分割情况。 ## 数据格式数据集中的每一行对应两个表格片段及其对应的真实合并版本，所有内容均采用超文本标记语言（HTML，HyperText Markup Language）格式。 ### 字段说明： python { 'image_name': 字符串, # 原始表格图像的文件名 'type': 字符串, # 原始表格类型，可取值为"simple"或"complex"，分别对应简单表格与复杂表格 'gt_table': 字符串, # 原始表格的真实HTML格式内容 'table_fragment_1': 字符串, # 第一个表格片段的HTML格式内容 'table_fragment_2': 字符串, # 第二个表格片段的HTML格式内容 } ## 授权协议本数据集采用Apache-2.0开源协议进行授权。

提供机构：

maas

创建时间：

2025-07-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集