OCRFlux-bench-cross

Name: OCRFlux-bench-cross
Creator: maas
Published: 2025-12-05 16:40:07
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-05 收录

下载链接：

https://modelscope.cn/datasets/ChatDOC/OCRFlux-bench-cross

下载链接

链接失效反馈

官方服务：

资源简介：

# OCRFlux-bench-cross PDF documents are typically paginated, which often results in tables or paragraphs being split across consecutive pages. Accurately detecting and merging such cross-page structures is crucial to avoid generating incomplete or fragmented content. The detection task can be formulated as follows: given the Markdowns of two consecutive pages—each structured as a list of Markdown elements (e.g., paragraphs and tables)—the goal is to identify the indexes of elements that should be merged across the pages. OCRFlux-bench-cross is a benchmark of 1000 samples that can be used to measure the performance of OCR systems in the cross-page table/paragraph detection task. Quick links: - 🤗 [Model](https://huggingface.co/ChatDOC/OCRFlux-3B) - 🛠️ [Code](https://github.com/chatdoc-com/OCRFlux) ## Data Mix ## Table 1: Samples breakdown by language | Language | Samples | |--------|-------------| | English | 500 | | Chinese | 500 | | **Total** | **1000** | ## Data Format Each row in the dataset corresponds to two consecutive pages, their corresponding Markdown element lists and the indexes of elements that need to be merged. If no tables or paragraphs require merging, the indexes in the annotation data are left empty. ### Features: ```python { 'pdf_name_1': string, # Name of the first PDF document 'pdf_name_2': string, # Name of the second PDF document 'language': string, # Language of the PDF document, zh or en 'md_elem_list_1': list, # List of the first page's Markdown elements 'md_elem_list_2': list, # List of the second page's Markdown elements 'merging_idx_pairs': list, # Pairs of Markdown element indexes in the first and second page which should be merged, be [] if no merge is needed } ``` ## License This dataset is licensed under Apache-2.0.

# OCRFlux-bench-cross 跨页结构检测基准数据集 PDF文档通常采用分页排版，这往往会导致表格或段落被拆分至连续的两页中。精准检测并合并这类跨页结构，对于避免生成不完整或碎片化的内容至关重要。该检测任务可定义如下：给定连续两页的Markdown文本——每一页均由Markdown元素列表构成，例如段落与表格——目标是识别出跨页需要合并的元素索引。 OCRFlux-bench-cross是一个包含1000个样本的基准数据集，可用于评估光学字符识别（Optical Character Recognition，OCR）系统在跨页表格/段落检测任务中的性能。快速链接： - 🤗 [模型](https://huggingface.co/ChatDOC/OCRFlux-3B) - 🛠️ [代码](https://github.com/chatdoc-com/OCRFlux) ## 数据集构成 ## 表1：按语言划分的样本分布 | 语言 | 样本数量 | |--------|-------------| | 英语 | 500 | | 中文 | 500 | | **总计** | **1000** | ## 数据格式数据集中的每一行对应连续两页文档、其对应的Markdown元素列表，以及需要合并的元素索引。若无需合并任何表格或段落，则标注数据中的索引为空列表。 ### 字段说明： python { 'pdf_name_1': string, # 第一页所属PDF文档的名称 'pdf_name_2': string, # 第二页所属PDF文档的名称 'language': string, # PDF文档的语言，取值为zh或en 'md_elem_list_1': list, # 第一页的Markdown元素列表 'md_elem_list_2': list, # 第二页的Markdown元素列表 'merging_idx_pairs': list, # 第一页与第二页中需要合并的Markdown元素索引对，若无合并需求则为空列表[] } ## 许可证本数据集采用Apache-2.0许可证进行授权。

提供机构：

maas

创建时间：

2025-07-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集