five

OmniDocBench

收藏
魔搭社区2026-05-23 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/evalscope/OmniDocBench
下载链接
链接失效反馈
官方服务:
资源简介:
# OmniDocBench [English](./README.md) | [简体中文](./README_ZH.md) **OmniDocBench** is an evaluation dataset for diverse document parsing in real-world scenarios, with the following characteristics: - **Diverse Document Types**: The evaluation set contains **1651** PDF pages, covering **10** document types, **5** layout types and **5** language types. Coverage includes academic literature, research and financial reports, newspapers, textbooks, exam papers, magazines, handwritten notes, historical documents, and more. - **Rich Annotations**: Contains localization for **28** block-level categories (text paragraphs, titles, tables, formulas, headers/footers, etc.) and **4** span-level categories (text lines, inline formulas, superscripts/subscripts, etc.), plus recognition results for each region (text, LaTeX for formulas, LaTeX and HTML for tables). OmniDocBench also provides reading-order annotations for layout elements. Page- and block-level attribute labels include **5** page attribute categories, **3** text-related attributes and **6** table-related attributes. - **High Annotation Quality**: Through manual screening, intelligent annotation, manual annotation, full expert quality inspection and large model quality inspection, the data quality is relatively high. - **Evaluation Code Suite**: Designed with end-to-end evaluation and single module evaluation code to ensure fairness and accuracy of evaluation. The evaluation code suite can be found at [OmniDocBench](https://github.com/opendatalab/OmniDocBench). ## Updates - [2026/04/09] (1) Added a **296-page** hard subset for difficult formulas, tables, and layouts; (2) Corrected part of the table, formula, and OCR annotations from v1.5. The full **1651-page** release is in `OmniDocBench.json`. - [2025/09/25] (1) Newspaper and note images were upgraded to **200 DPI**; fixed some OCR and table GT issues from v1.0. (2) To balance Chinese and English pages and increase pages with formulas, **374** pages were added (25 Chinese, 349 English), including books, PPT-to-PDF, colorful textbooks, exam papers, magazines, and newspapers; display (`equation_isolated`) formulas increased from **353** to **1050**; formula language attributes were added (**68** Chinese display formulas, **982** English display formulas). - [2024/12/25] Added PDF format of the evaluation set for models that require PDFs as input for evaluation. Added original PDF slices with metadata. - [2024/12/10] Fixed height and width fields for some samples. This fix only affects page-level height and width fields and does not impact the correctness of other annotations - [2024/12/04] Released OmniDocBench evaluation dataset ## Dataset Introduction The evaluation set contains **1651** PDF pages, covering **10** document types, **5** layout types and **5** language types. OmniDocBench has rich annotations, including **28** block-level categories (text paragraphs, titles, tables, formulas, headers/footers, etc.) and **4** span-level categories (text lines, inline formulas, superscripts/subscripts, etc.). All text-related annotation boxes contain text recognition annotations, formulas contain LaTeX annotations, and tables contain both LaTeX and HTML annotations. OmniDocBench also provides reading order annotations for document components. Additionally, it includes various attribute labels at page and block levels, with 5 page attribute categories, 3 text attribute labels and 6 table attribute labels. ![](data_diversity.png) ## Usage You can use our [evaluation method](https://github.com/opendatalab/OmniDocBench) to conduct evaluations across several dimensions: - End-to-end evaluation: Includes both end2end and md2md evaluation methods - Layout detection - Table recognition - Formula recognition - Text OCR The evaluation dataset files include: - [OmniDocBench.json](OmniDocBench.json) is the full annotation file for the evaluation dataset (**1651** pages), stored in JSON format. It supports the end2end evaluation method. The structure and fields are explained below. - [images](./images/) are the corresponding evaluation dataset images, for models that require images as input. - [image_to_pdf.py](https://github.com/opendatalab/OmniDocBench/blob/main/tools/image_to_pdf.py) is the script to covert images to PDFs for those models who take only PDFs as input. <details> <summary>Dataset Format</summary> The dataset format is JSON, with the following structure and field explanations: ```json [{ "layout_dets": [ // List of page elements { "category_type": "text_block", // Category name "poly": [ 136.0, // Position information, coordinates for top-left, top-right, bottom-right, bottom-left corners (x,y) 781.0, 340.0, 781.0, 340.0, 806.0, 136.0, 806.0 ], "ignore": false, // Whether to ignore during evaluation "order": 0, // Reading order "anno_id": 0, // Special annotation ID, unique for each layout box "text": "xxx", // Optional field, Text OCR results are written here "latex": "$xxx$", // Optional field, LaTeX for formulas and tables is written here "html": "xxx", // Optional field, HTML for tables is written here "attribute" {"xxx": "xxx"}, // Classification attributes for layout, detailed below "line_with_spans:": [ // Span level annotation boxes { "category_type": "text_span", "poly": [...], "ignore": false, "text": "xxx", "latex": "$xxx$", }, ... ], "merge_list": [ // Only present in annotation boxes with merge relationships, merge logic depends on whether single line break separated paragraphs exist, like list types { "category_type": "text_block", "poly": [...], ... // Same fields as block level annotations "line_with_spans": [...] ... }, ... ] ... ], "page_info": { "page_no": 0, // Page number "height": 1684, // Page height "width": 1200, // Page width "image_path": "xx/xx/", // Annotated page filename "page_attribute": {"xxx": "xxx"} // Page attribute labels }, "extra": { "relation": [ // Related annotations { "source_anno_id": 1, "target_anno_id": 2, "relation": "parent_son" // Relationship label between figure/table and their corresponding caption/footnote categories }, { "source_anno_id": 5, "target_anno_id": 6, "relation_type": "truncated" // Paragraph truncation relationship label due to layout reasons, will be concatenated and evaluated as one paragraph during evaluation }, ] } }, ... ] ``` </details> <details> <summary>Evaluation Categories</summary> Evaluation categories include: ``` # Block level annotation boxes (28 category_type values in v1.6 full release) 'title' # Title 'text_block' # Paragraph level plain text 'list_group' # List group 'reference' # References 'figure' # Figure 'figure_caption' # Figure caption / title 'figure_footnote' # Figure note 'table' # Table body 'table_caption' # Table caption / title 'table_footnote' # Table footnote 'equation_isolated' # Display formula 'equation_caption' # Formula number / tag 'equation_semantic' # Semantic formula region 'equation_explanation' # Formula explanation / derivation-like text 'header' # Header 'footer' # Footer 'page_number' # Page number 'page_footnote' # Page footnote 'abandon' # Discarded / irrelevant regions 'code_txt' # Code block 'code_txt_caption' # Code caption 'chart_mask' # Chart region to mask 'table_mask' # Table region to mask 'text_mask' # Text region to mask 'organic_chemical_formula_mask' # Organic chemistry structure mask 'algorithm_mask' # Algorithm / pseudocode mask 'unknown_mask' # Other mask class 'need_mask' # Region requiring masking / pending mask class # Span level annotation boxes 'text_span' # Span level plain text 'equation_ignore', # Formula to be ignored 'equation_inline', # Inline formula 'footnote_mark', # Document superscripts/subscripts ``` </details> <details> <summary>Attribute Labels</summary> Page classification attributes include: ``` 'data_source': #PDF type classification academic_literature # Academic literature PPT2PDF # PPT to PDF book # Black and white books and textbooks colorful_textbook # Colorful textbooks with images exam_paper # Exam papers note # Handwritten notes magazine # Magazines research_report # Research reports and financial reports newspaper # Newspapers historical_document # Historical documents 'language': #Language type (page attribute values) english # English simplified_chinese # Simplified Chinese en_ch_mixed # English-Chinese mixed traditional_chinese # Traditional Chinese other # Other 'layout': #Page layout type single_column # Single column double_column # Double column three_column # Three column 1andmore_column # One mixed with multiple columns, common in literature other_layout # Other layouts 'watermark': # Whether contains watermark true false 'fuzzy_scan': # Whether blurry scanned true false 'colorful_backgroud': # Whether contains colorful background, content to be recognized has more than two background colors true false ``` Block level attribute - Table related attributes: ``` 'table_layout': # Table orientation vertical # Vertical table horizontal # Horizontal table 'with_span': # Merged cells False True 'line': # Table borders full_line # Full borders less_line # Partial borders fewer_line # Three-line borders wireless_line # No borders 'language': # Table language table_en # English table table_simplified_chinese # Simplified Chinese table table_en_ch_mixed # English-Chinese mixed table 'include_equation': # Whether table contains formulas False True 'include_backgroud': # Whether table contains background color False True 'table_vertical' # Whether table is rotated 90 or 270 degrees False True ``` Block level attribute - Text paragraph related attributes: ``` 'text_language': # Text language text_en # English text_simplified_chinese # Simplified Chinese text_en_ch_mixed # English-Chinese mixed 'text_background': # Text background color white # Default value, white background single_colored # Single background color other than white multi_colored # Multiple background colors 'text_rotate': # Text rotation classification within paragraphs normal # Default value, horizontal text, no rotation rotate90 # Rotation angle, 90 degrees clockwise rotate180 # 180 degrees clockwise rotate270 # 270 degrees clockwise horizontal # Text is normal but layout is vertical ``` Block level attribute - Formula related attributes: ``` 'formula_type': # Formula type print # Print handwriting # Handwriting equation_en # English formula equation_ch # Chinese formula ``` </details> ## Data Display ![](show_pdf_types_1.png) ![](show_pdf_types_2.png) ## Acknowledgement - Thank [Abaka AI](https://abaka.ai) for supporting the dataset annotation. ## Copyright Statement The PDFs are collected from public online channels and community user contributions. Content that is not allowed for distribution has been removed. The dataset is for research purposes only and not for commercial use. If there are any copyright concerns, please contact OpenDataLab@pjlab.org.cn. ## Citation ```bibtex @misc{ouyang2024omnidocbenchbenchmarkingdiversepdf, title={OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations}, author={Linke Ouyang and Yuan Qu and Hongbin Zhou and Jiawei Zhu and Rui Zhang and Qunshu Lin and Bin Wang and Zhiyuan Zhao and Man Jiang and Xiaomeng Zhao and Jin Shi and Fan Wu and Pei Chu and Minghao Liu and Zhenxiang Li and Chao Xu and Bo Zhang and Botian Shi and Zhongying Tu and Conghui He}, year={2024}, eprint={2412.07626}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.07626}, } ``` ## Links - Paper: https://huggingface.co/papers/2412.07626 - GitHub: https://github.com/opendatalab/OmniDocBench

# OmniDocBench [英文文档](./README.md) | [简体中文文档](./README_ZH.md) **OmniDocBench** 是面向真实场景下多样化文档解析的评测数据集,具备以下特点: - **多样化文档类型**:评测集包含1355页PDF,覆盖9种文档类型、4种版式类型与3种语言类型,涵盖学术论文、财务报告、报纸、教科书、手写笔记等丰富场景。 - **丰富标注**:包含15种块级(文本段落、标题、表格等,总计超2万个)与4种跨度级(文本行、行内公式、上下标等,总计超8万个)文档元素的位置信息,以及各元素区域的识别结果(文本标注、LaTeX公式标注、同时包含LaTeX与HTML标注的表格)。此外,OmniDocBench还提供文档组件的阅读顺序标注,同时包含页面级与块级的多种属性标签,涵盖5种页面属性标签、3种文本属性标签与6种表格属性标签。 - **高标注质量**:通过人工筛选、智能标注、人工标注、全流程专家质检与大模型质检,确保数据集具备较高的标注质量。 - **评测代码套件**:配套端到端评测与单模块评测代码,保障评测的公平性与准确性。评测代码套件可在[OmniDocBench](https://github.com/opendatalab/OmniDocBench)获取。 ## 更新日志 - [2025/09/25] 重大更新:从v1.0升级至v1.5: - 报纸与笔记类图像分辨率提升至200 DPI。 - 为平衡中英语言页面数量并增加含公式的页面数量,新增374页数据,其中中文25页、英文349页,涵盖书籍、PPT、彩色插图教科书、试卷、杂志与报纸。行内公式数量从353个增至1050个。 - 为公式新增语言属性,中文公式数量增至68个,英文公式数量增至982个。 - 修复v1.0版本中部分文本与表格标注的拼写错误。 - [2024/12/25] 新增评测集的PDF格式,以供需要以PDF作为输入的模型进行评测。新增带元数据的原始PDF切片。 - [2024/12/10] 修复部分样本的高度与宽度字段,该修复仅影响页面级高度与宽度字段,不会影响其他标注的正确性。 - [2024/12/04] 发布OmniDocBench评测数据集。 ## 数据集介绍 评测集包含1355页PDF,覆盖9种文档类型、4种版式类型与3种语言类型。OmniDocBench具备丰富的标注能力,涵盖15种块级标注(文本段落、标题、表格等)与4种跨度级标注(文本行、行内公式、上下标等)。所有与文本相关的标注框均包含文本识别标注,公式包含LaTeX标注,表格同时包含LaTeX与HTML标注。OmniDocBench还提供文档组件的阅读顺序标注,同时包含页面级与块级的多种属性标签,涵盖5种页面属性标签、3种文本属性标签与6种表格属性标签。 ![](data_diversity.png) ## 使用方法 您可通过我们的[评测方法](https://github.com/opendatalab/OmniDocBench)从多个维度开展评测: - 端到端评测:包含end2end与md2md两种评测方式 - 版式检测 - 表格识别 - 公式识别 - 文本OCR(Optical Character Recognition) 评测数据集文件包括: - [OmniDocBench.json](OmniDocBench.json) 为评测数据集的标注文件,以JSON格式存储,支持end2end评测方式,其结构与字段说明如下。 - [images](./images/) 为对应的评测数据集图像,以供需要以图像作为输入的模型使用。 - [image_to_pdf.py](https://github.com/opendatalab/OmniDocBench/blob/main/tools/image_to_pdf.py) 为图像转PDF的脚本,供仅支持PDF作为输入的模型使用。 <details> <summary>数据集格式</summary> 数据集格式为JSON,其结构与字段说明如下: json [{ "layout_dets": [ // 页面元素列表 { "category_type": "text_block", // 类别名称 "poly": [ 136.0, // 位置信息,四角(左上、右上、右下、左下)的(x,y)坐标 781.0, 340.0, 781.0, 340.0, 806.0, 136.0, 806.0 ], "ignore": false, // 评测时是否忽略该元素 "order": 0, // 阅读顺序 "anno_id": 0, // 特殊标注ID,每个布局框唯一 "text": "xxx", // 可选字段,存储文本OCR识别结果 "latex": "$xxx$", // 可选字段,存储公式与表格的LaTeX代码 "html": "xxx", // 可选字段,存储表格的HTML代码 "attribute": {"xxx": "xxx"}, // 布局分类属性,详见下文 "line_with_spans": [ // 跨度级标注框列表 { "category_type": "text_span", "poly": [...], "ignore": false, "text": "xxx", "latex": "$xxx$", }, ... ], "merge_list": [ // 仅在存在合并关系的标注框中出现,合并逻辑取决于是否存在单换行分隔的段落(如列表类型) { "category_type": "text_block", "poly": [...], ... // 与块级标注相同的字段 "line_with_spans": [...] ... }, ... ] ... ], "page_info": { "page_no": 0, // 页码 "height": 1684, // 页面高度 "width": 1200, // 页面宽度 "image_path": "xx/xx/", // 标注页面的文件名路径 "page_attribute": {"xxx": "xxx"} // 页面属性标签 }, "extra": { "relation": [ // 相关标注关系 { "source_anno_id": 1, "target_anno_id": 2, "relation": "parent_son" // 图像/表格与其对应说明/脚注之间的关系标签 }, { "source_anno_id": 5, "target_anno_id": 6, "relation_type": "truncated" // 因版式原因导致的段落截断关系标签,评测时将拼接为一个段落进行评估 }, ] } }, ... ] </details> <details> <summary>评测类别</summary> 评测类别包括: # 块级标注框 'title' # 标题 'text_block' # 段落级纯文本 'figure', # 图像 'figure_caption', # 图像说明/标题 'figure_footnote', # 图像注释 'table', # 表格主体 'table_caption', # 表格说明/标题 'table_footnote', # 表格注释 'equation_isolated', # 独立公式 'equation_caption', # 公式编号 'header' # 页眉 'footer' # 页脚 'page_number' # 页码 'page_footnote' # 页面注释 'abandon', # 其他废弃内容(例如页面中部的无关信息) 'code_txt', # 代码块 'code_txt_caption', # 代码块说明 'reference', # 参考文献 # 跨度级标注框 'text_span' # 跨度级纯文本 'equation_ignore', # 需忽略的公式 'equation_inline', # 行内公式 'footnote_mark', # 文档上下标 </details> <details> <summary>属性标签</summary> 页面分类属性包括: 'data_source': #PDF类型分类 academic_literature # 学术文献 PPT2PDF # PPT转PDF book # 黑白书籍与教科书 colorful_textbook # 带图像的彩色教科书 exam_paper # 试卷 note # 手写笔记 magazine # 杂志 research_report # 研究报告与财务报告 newspaper # 报纸 'language': #语言类型 en # 英语 simplified_chinese # 简体中文 en_ch_mixed # 中英混合 'layout': #页面版式类型 single_column # 单栏 double_column # 双栏 three_column # 三栏 1andmore_column # 单栏与多栏混合,常见于学术文献 other_layout # 其他版式 'watermark': # 是否包含水印 true # 是 false # 否 'fuzzy_scan': # 是否为模糊扫描件 true # 是 false # 否 'colorful_backgroud': # 是否包含彩色背景,即待识别内容的背景色超过两种 true # 是 false # 否 块级属性——表格相关属性: 'table_layout': # 表格朝向 vertical # 垂直表格 horizontal # 水平表格 'with_span': # 是否包含合并单元格 False # 否 True # 是 'line': # 表格边框 full_line # 全边框 less_line # 部分边框 fewer_line # 三线表 wireless_line # 无边框 'language': # 表格语言 table_en # 英文表格 table_simplified_chinese # 简体中文表格 table_en_ch_mixed # 中英混合表格 'include_equation': # 表格是否包含公式 False # 否 True # 是 'include_backgroud': # 表格是否包含背景色 False # 否 True # 是 'table_vertical' # 表格是否旋转90或270度 False # 否 True # 是 块级属性——文本段落相关属性: 'text_language': # 文本语言 text_en # 英语 text_simplified_chinese # 简体中文 text_en_ch_mixed # 中英混合 'text_background': # 文本背景色 white # 默认值,白色背景 single_colored # 非白色的单一背景色 multi_colored # 多种背景色 'text_rotate': # 段落内文本旋转分类 normal # 默认值,水平文本,无旋转 rotate90 # 顺时针旋转90度 rotate180 # 顺时针旋转180度 rotate270 # 顺时针旋转270度 horizontal # 文本正常但版式为垂直版式 块级属性——公式相关属性: 'formula_type': # 公式类型 print # 印刷体 handwriting # 手写体 'equation_language' # 公式语言 equation_en # 英语 equation_ch # 汉语 </details> ## 数据展示 ![](show_pdf_types_1.png) ![](show_pdf_types_2.png) ## 致谢 - 感谢[Abaka AI](https://abaka.ai)对数据集标注的支持。 ## 版权声明 本数据集的PDF来源于公开网络渠道与社区用户贡献,已移除不允许分发的内容。本数据集仅用于科研用途,不得用于商业用途。若有版权相关问题,请联系OpenDataLab@pjlab.org.cn。 ## 引用 bibtex @misc{ouyang2024omnidocbenchbenchmarkingdiversepdf, title={OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations}, author={Linke Ouyang and Yuan Qu and Hongbin Zhou and Jiawei Zhu and Rui Zhang and Qunshu Lin and Bin Wang and Zhiyuan Zhao and Man Jiang and Xiaomeng Zhao and Jin Shi and Fan Wu and Pei Chu and Minghao Liu and Zhenxiang Li and Chao Xu and Bo Zhang and Botian Shi and Zhongying Tu and Conghui He}, year={2024}, eprint={2412.07626}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.07626}, } ## 相关链接 - 论文:https://huggingface.co/papers/2412.07626 - GitHub仓库:https://github.com/opendatalab/OmniDocBench
提供机构:
maas
创建时间:
2025-10-22
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
OmniDocBench是一个用于评估多样化文档解析的数据集,包含1651个PDF页面,覆盖10种文档类型和5种语言类型,具有丰富的注释和高标注质量。数据集还提供了评估代码套件,支持多种评估方法。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作