five

chhhhhgrghdu/OmniDocBench

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/chhhhhgrghdu/OmniDocBench
下载链接
链接失效反馈
官方服务:
资源简介:
# OmniDocBench [English](./README.md) | [简体中文](./README_ZH.md) **OmniDocBench** is an evaluation dataset for diverse document parsing in real-world scenarios, with the following characteristics: - **Diverse Document Types**: The evaluation set contains 1355 PDF pages, covering 9 document types, 4 layout types and 3 language types. It has broad coverage including academic papers, financial reports, newspapers, textbooks, handwritten notes, etc. - **Rich Annotations**: Contains location information for 15 block-level (text paragraphs, titles, tables, etc., over 20k in total) and 4 span-level (text lines, inline formulas, superscripts/subscripts, etc., over 80k in total) document elements, as well as recognition results for each element region (text annotations, LaTeX formula annotations, tables with both LaTeX and HTML annotations). OmniDocBench also provides reading order annotations for document components. Additionally, it includes various attribute labels at page and block levels, with 5 page attribute labels, 3 text attribute labels and 6 table attribute labels. - **High Annotation Quality**: Through manual screening, intelligent annotation, manual annotation, full expert quality inspection and large model quality inspection, the data quality is relatively high. - **Evaluation Code Suite**: Designed with end-to-end evaluation and single module evaluation code to ensure fairness and accuracy of evaluation. The evaluation code suite can be found at [OmniDocBench](https://github.com/opendatalab/OmniDocBench). ## Updates - [2025/09/25] Major Update: updated from v1.0 to v1.5: - Images of newspaper and notes have been increased to 200 DPI. - To balance the number of pages in Chinese and English and increase the number of pages containing formulas, 374 new pages have been added, including 25 in Chinese and 349 in English. These pages include books, PPTs, color illustrated textbooks, test papers, magazines, and newspapers. The number of inline formulas has increased from 353 to 1050. - Language attributes have been added to formulas, increasing the number of Chinese formulas to 68 and English formulas to 982. - Fixed typos in some text and table annotations in v1.0. - [2024/12/25] Added PDF format of the evaluation set for models that require PDFs as input for evaluation. Added original PDF slices with metadata. - [2024/12/10] Fixed height and width fields for some samples. This fix only affects page-level height and width fields and does not impact the correctness of other annotations - [2024/12/04] Released OmniDocBench evaluation dataset ## Dataset Introduction The evaluation set contains 1355 PDF pages, covering 9 document types, 4 layout types and 3 language types. OmniDocBench has rich annotations, including 15 block-level annotations (text paragraphs, titles, tables, etc.) and 4 span-level annotations (text lines, inline formulas, superscripts/subscripts, etc.). All text-related annotation boxes contain text recognition annotations, formulas contain LaTeX annotations, and tables contain both LaTeX and HTML annotations. OmniDocBench also provides reading order annotations for document components. Additionally, it includes various attribute labels at page and block levels, with 5 page attribute labels, 3 text attribute labels and 6 table attribute labels. ![](data_diversity.png) ## Usage You can use our [evaluation method](https://github.com/opendatalab/OmniDocBench) to conduct evaluations across several dimensions: - End-to-end evaluation: Includes both end2end and md2md evaluation methods - Layout detection - Table recognition - Formula recognition - Text OCR The evaluation dataset files include: - [OmniDocBench.json](OmniDocBench.json) is the annotation file for the evaluation dataset, stored in JSON format. It supports the end2end evaluation method. The structure and fields are explained below. - [images](./images/) are the corresponding evaluation dataset images, for models that require images as input. - [image_to_pdf.py](https://github.com/opendatalab/OmniDocBench/blob/main/tools/image_to_pdf.py) is the script to covert images to PDFs for those models who take only PDFs as input. <details> <summary>Dataset Format</summary> The dataset format is JSON, with the following structure and field explanations: ```json [{ "layout_dets": [ // List of page elements { "category_type": "text_block", // Category name "poly": [ 136.0, // Position information, coordinates for top-left, top-right, bottom-right, bottom-left corners (x,y) 781.0, 340.0, 781.0, 340.0, 806.0, 136.0, 806.0 ], "ignore": false, // Whether to ignore during evaluation "order": 0, // Reading order "anno_id": 0, // Special annotation ID, unique for each layout box "text": "xxx", // Optional field, Text OCR results are written here "latex": "$xxx$", // Optional field, LaTeX for formulas and tables is written here "html": "xxx", // Optional field, HTML for tables is written here "attribute" {"xxx": "xxx"}, // Classification attributes for layout, detailed below "line_with_spans:": [ // Span level annotation boxes { "category_type": "text_span", "poly": [...], "ignore": false, "text": "xxx", "latex": "$xxx$", }, ... ], "merge_list": [ // Only present in annotation boxes with merge relationships, merge logic depends on whether single line break separated paragraphs exist, like list types { "category_type": "text_block", "poly": [...], ... // Same fields as block level annotations "line_with_spans": [...] ... }, ... ] ... ], "page_info": { "page_no": 0, // Page number "height": 1684, // Page height "width": 1200, // Page width "image_path": "xx/xx/", // Annotated page filename "page_attribute": {"xxx": "xxx"} // Page attribute labels }, "extra": { "relation": [ // Related annotations { "source_anno_id": 1, "target_anno_id": 2, "relation": "parent_son" // Relationship label between figure/table and their corresponding caption/footnote categories }, { "source_anno_id": 5, "target_anno_id": 6, "relation_type": "truncated" // Paragraph truncation relationship label due to layout reasons, will be concatenated and evaluated as one paragraph during evaluation }, ] } }, ... ] ``` </details> <details> <summary>Evaluation Categories</summary> Evaluation categories include: ``` # Block level annotation boxes 'title' # Title 'text_block' # Paragraph level plain text 'figure', # Figure type 'figure_caption', # Figure description/title 'figure_footnote', # Figure notes 'table', # Table body 'table_caption', # Table description/title 'table_footnote', # Table notes 'equation_isolated', # Display formula 'equation_caption', # Formula number 'header' # Header 'footer' # Footer 'page_number' # Page number 'page_footnote' # Page notes 'abandon', # Other discarded content (e.g. irrelevant information in middle of page) 'code_txt', # Code block 'code_txt_caption', # Code block description 'reference', # References # Span level annotation boxes 'text_span' # Span level plain text 'equation_ignore', # Formula to be ignored 'equation_inline', # Inline formula 'footnote_mark', # Document superscripts/subscripts ``` </details> <details> <summary>Attribute Labels</summary> Page classification attributes include: ``` 'data_source': #PDF type classification academic_literature # Academic literature PPT2PDF # PPT to PDF book # Black and white books and textbooks colorful_textbook # Colorful textbooks with images exam_paper # Exam papers note # Handwritten notes magazine # Magazines research_report # Research reports and financial reports newspaper # Newspapers 'language': #Language type en # English simplified_chinese # Simplified Chinese en_ch_mixed # English-Chinese mixed 'layout': #Page layout type single_column # Single column double_column # Double column three_column # Three column 1andmore_column # One mixed with multiple columns, common in literature other_layout # Other layouts 'watermark': # Whether contains watermark true false 'fuzzy_scan': # Whether blurry scanned true false 'colorful_backgroud': # Whether contains colorful background, content to be recognized has more than two background colors true false ``` Block level attribute - Table related attributes: ``` 'table_layout': # Table orientation vertical # Vertical table horizontal # Horizontal table 'with_span': # Merged cells False True 'line': # Table borders full_line # Full borders less_line # Partial borders fewer_line # Three-line borders wireless_line # No borders 'language': # Table language table_en # English table table_simplified_chinese # Simplified Chinese table table_en_ch_mixed # English-Chinese mixed table 'include_equation': # Whether table contains formulas False True 'include_backgroud': # Whether table contains background color False True 'table_vertical' # Whether table is rotated 90 or 270 degrees False True ``` Block level attribute - Text paragraph related attributes: ``` 'text_language': # Text language text_en # English text_simplified_chinese # Simplified Chinese text_en_ch_mixed # English-Chinese mixed 'text_background': # Text background color white # Default value, white background single_colored # Single background color other than white multi_colored # Multiple background colors 'text_rotate': # Text rotation classification within paragraphs normal # Default value, horizontal text, no rotation rotate90 # Rotation angle, 90 degrees clockwise rotate180 # 180 degrees clockwise rotate270 # 270 degrees clockwise horizontal # Text is normal but layout is vertical ``` Block level attribute - Formula related attributes: ``` 'formula_type': # Formula type print # Print handwriting # Handwriting 'equation_language' # Formula language equation_en # English equation_ch # Chinese ``` </details> ## Data Display ![](show_pdf_types_1.png) ![](show_pdf_types_2.png) ## Acknowledgement - Thank [Abaka AI](https://abaka.ai) for supporting the dataset annotation. ## Copyright Statement The PDFs are collected from public online channels and community user contributions. Content that is not allowed for distribution has been removed. The dataset is for research purposes only and not for commercial use. If there are any copyright concerns, please contact OpenDataLab@pjlab.org.cn. ## Citation ```bibtex @misc{ouyang2024omnidocbenchbenchmarkingdiversepdf, title={OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations}, author={Linke Ouyang and Yuan Qu and Hongbin Zhou and Jiawei Zhu and Rui Zhang and Qunshu Lin and Bin Wang and Zhiyuan Zhao and Man Jiang and Xiaomeng Zhao and Jin Shi and Fan Wu and Pei Chu and Minghao Liu and Zhenxiang Li and Chao Xu and Bo Zhang and Botian Shi and Zhongying Tu and Conghui He}, year={2024}, eprint={2412.07626}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.07626}, } ``` ## Links - Paper: https://huggingface.co/papers/2412.07626 - GitHub: https://github.com/opendatalab/OmniDocBench

# OmniDocBench [英文文档](./README.md) | [简体中文文档](./README_ZH.md) **OmniDocBench** 是面向真实场景下多样化文档解析的评测数据集,具备以下核心特性: - **多样化文档类型**:评测集包含1355个PDF页面,覆盖9种文档类型、4种布局类型与3种语言类型,涵盖学术论文、金融报告、报纸、教科书、手写笔记等丰富场景。 - **丰富标注体系**:包含总计超2万个块级(block-level)和8万个跨度级(span-level)文档元素的位置信息,覆盖文本段落、标题、表格等15种块级元素,以及文本行、内嵌公式、上下标等4种跨度级元素;同时为每个元素区域提供识别结果:文本类标注附带文本识别结果,公式类标注附带LaTeX标注,表格类标注同时支持LaTeX与HTML标注。此外,OmniDocBench还提供了文档组件的阅读顺序标注,并包含页面级与块级的多类属性标签:5种页面属性标签、3种文本属性标签与6种表格属性标签。 - **高标注质量**:通过人工筛选、智能标注、人工标注、全流程专家质检与大模型质检的多层校验流程,确保数据集具备较高的标注质量。 - **评测代码套件**:配套了端到端评测(end-to-end evaluation)与单模块评测的代码,保障评测的公平性与准确性。完整评测代码套件可访问[OmniDocBench](https://github.com/opendatalab/OmniDocBench)获取。 ## 更新记录 - [2025/09/25] 重大更新:版本从v1.0升级至v1.5: - 将报纸与手写笔记类图片的分辨率提升至200 DPI。 - 为平衡中英语言页面占比并扩充含公式的页面数量,新增374个页面(其中中文25页、英文349页),涵盖书籍、PPT、彩色插图教材、试卷、杂志与报纸等类型;内嵌公式数量从353个增至1050个。 - 为公式新增语言属性,其中中文公式增至68个,英文公式增至982个。 - 修复了v1.0版本中部分文本与表格标注的拼写错误。 - [2024/12/25] 新增评测集的PDF格式版本,适配以PDF作为输入的评测模型;新增带元数据的原始PDF切片文件。 - [2024/12/10] 修复了部分样本的高、宽字段,该修正仅影响页面级高宽字段,不影响其他标注的正确性。 - [2024/12/04] 正式发布OmniDocBench评测数据集。 ## 数据集简介 本评测集包含1355个PDF页面,覆盖9种文档类型、4种布局类型与3种语言类型。OmniDocBench具备丰富的标注体系,包含15种块级标注与4种跨度级标注。所有与文本相关的标注框均附带文本识别标注,公式类标注附带LaTeX标注,表格类标注同时支持LaTeX与HTML标注。此外,数据集还提供了文档组件的阅读顺序标注,并包含页面级与块级的多类属性标签:5种页面属性标签、3种文本属性标签与6种表格属性标签。 ![](data_diversity.png) ## 使用方法 您可通过我们的[评测方法](https://github.com/opendatalab/OmniDocBench)从多个维度开展评测: - 端到端评测:包含end2end与md2md两种评测方式 - 布局检测 - 表格识别 - 公式识别 - 文本OCR 评测数据集文件包含: - [OmniDocBench.json](OmniDocBench.json) 为评测集的标注文件,采用JSON格式存储,支持end2end评测方式,其结构与字段说明如下。 - [images](./images/) 为对应的评测集图片文件,适配以图像作为输入的模型。 - [image_to_pdf.py](https://github.com/opendatalab/OmniDocBench/blob/main/tools/image_to_pdf.py) 为图像转PDF的脚本,用于仅支持PDF作为输入的模型。 <details> <summary>数据集格式</summary> 本数据集采用JSON格式,其结构与字段说明如下: json [{ "layout_dets": [ // 页面元素列表 { "category_type": "text_block", // 类别名称 "poly": [ 136.0, // 位置信息,依次为左上角、右上角、右下角、左下角的坐标(x,y) 781.0, 340.0, 781.0, 340.0, 806.0, 136.0, 806.0 ], "ignore": false, // 评测时是否忽略该元素 "order": 0, // 阅读顺序 "anno_id": 0, // 特殊标注ID,每个布局框唯一 "text": "xxx", // 可选字段,存储文本OCR识别结果 "latex": "$xxx$", // 可选字段,存储公式与表格的LaTeX代码 "html": "xxx", // 可选字段,存储表格的HTML代码 "attribute": {"xxx": "xxx"}, // 布局分类属性,详见下文 "line_with_spans": [ // 跨度级标注框 { "category_type": "text_span", "poly": [...], "ignore": false, "text": "xxx", "latex": "$xxx$", }, ... ], "merge_list": [ // 仅存在于存在合并关系的标注框中,合并逻辑取决于是否存在单换行分隔的段落(如列表类型) { "category_type": "text_block", "poly": [...], ... // 与块级标注字段一致 "line_with_spans": [...] ... }, ... ] ... ], "page_info": { "page_no": 0, // 页码 "height": 1684, // 页面高度 "width": 1200, // 页面宽度 "image_path": "xx/xx/", // 标注页面的文件名路径 "page_attribute": {"xxx": "xxx"} // 页面属性标签 }, "extra": { "relation": [ // 关联标注 { "source_anno_id": 1, "target_anno_id": 2, "relation": "parent_son" // 图表/表格与其对应说明/脚注的类别关系标签 }, { "source_anno_id": 5, "target_anno_id": 6, "relation_type": "truncated" // 因布局原因导致的段落截断关系标签,评测时会将其拼接为一个段落进行评估 }, ] } }, ... ] </details> <details> <summary>评测类别</summary> 评测类别包含: # 块级标注框 'title' # 标题 'text_block' # 段落级纯文本 'figure', # 图像类型 'figure_caption', # 图像说明/标题 'figure_footnote', # 图像注释 'table', # 表格主体 'table_caption', # 表格说明/标题 'table_footnote', # 表格注释 'equation_isolated', # 独立公式 'equation_caption', # 公式编号 'header' # 页眉 'footer' # 页脚 'page_number' # 页码 'page_footnote' # 页面注释 'abandon', # 其他废弃内容(例如页面中间的无关信息) 'code_txt', # 代码块 'code_txt_caption', # 代码块说明 'reference', # 参考文献 # 跨度级标注框 'text_span' # 跨度级纯文本 'equation_ignore', # 需忽略的公式 'equation_inline', # 内嵌公式 'footnote_mark', # 文档上下标 </details> <details> <summary>属性标签</summary> 页面分类属性包含: 'data_source': # PDF类型分类 academic_literature # 学术文献 PPT2PDF # PPT转PDF book # 黑白书籍与教科书 colorful_textbook # 带图像的彩色教科书 exam_paper # 试卷 note # 手写笔记 magazine # 杂志 research_report # 研究报告与金融报告 newspaper # 报纸 'language': # 语言类型 en # 英语 simplified_chinese # 简体中文 en_ch_mixed # 中英混合 'layout': # 页面布局类型 single_column # 单栏布局 double_column # 双栏布局 three_column # 三栏布局 1andmore_column # 一栏与多栏混合布局,常见于学术文献 other_layout # 其他布局 'watermark': # 是否包含水印 true false 'fuzzy_scan': # 是否为模糊扫描件 true false 'colorful_backgroud': # 是否包含彩色背景,即待识别内容的背景色多于两种 true false 块级属性-表格相关属性: 'table_layout': # 表格朝向 vertical # 垂直表格 horizontal # 水平表格 'with_span': # 是否存在合并单元格 False True 'line': # 表格边框 full_line # 全边框 less_line # 部分边框 fewer_line # 三线表边框 wireless_line # 无边框 'language': # 表格语言 table_en # 英文表格 table_simplified_chinese # 简体中文表格 table_en_ch_mixed # 中英混合表格 'include_equation': # 表格是否包含公式 False True 'include_backgroud': # 表格是否包含背景色 False True 'table_vertical' # 表格是否旋转90或270度 False True 块级属性-文本段落相关属性: 'text_language': # 文本语言 text_en # 英语 text_simplified_chinese # 简体中文 text_en_ch_mixed # 中英混合 'text_background': # 文本背景色 white # 默认值,白色背景 single_colored # 除白色外的单一背景色 multi_colored # 多种背景色 'text_rotate': # 段落内文本旋转分类 normal # 默认值,水平文本,无旋转 rotate90 # 顺时针旋转90度 rotate180 # 顺时针旋转180度 rotate270 # 顺时针旋转270度 horizontal # 文本正常但布局为垂直方向 块级属性-公式相关属性: 'formula_type': # 公式类型 print # 印刷体 handwriting # 手写体 'equation_language' # 公式语言 equation_en # 英语 equation_ch # 中文 </details> ## 数据展示 ![](show_pdf_types_1.png) ![](show_pdf_types_2.png) ## 致谢 - 感谢[Abaka AI](https://abaka.ai)对本数据集标注工作的支持。 ## 版权声明 本数据集所用PDF文件均从公开网络渠道与社区用户贡献处收集,已移除未经授权分发的内容。本数据集仅用于学术研究,严禁商用。若存在版权相关疑虑,请联系OpenDataLab@pjlab.org.cn。 ## 引用格式 bibtex @misc{ouyang2024omnidocbenchbenchmarkingdiversepdf, title={OmniDocBench: 面向多样化PDF文档解析的全维度标注基准评测}, author={Linke Ouyang and Yuan Qu and Hongbin Zhou and Jiawei Zhu and Rui Zhang and Qunshu Lin and Bin Wang and Zhiyuan Zhao and Man Jiang and Xiaomeng Zhao and Jin Shi and Fan Wu and Pei Chu and Minghao Liu and Zhenxiang Li and Chao Xu and Bo Zhang and Botian Shi and Zhongying Tu and Conghui He}, year={2024}, eprint={2412.07626}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.07626}, } ## 相关链接 - 论文:https://huggingface.co/papers/2412.07626 - GitHub仓库:https://github.com/opendatalab/OmniDocBench
提供机构:
chhhhhgrghdu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作