five

allenai/olmOCR-bench-1.5-preview

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/allenai/olmOCR-bench-1.5-preview
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by tags: - text configs: - config_name: olmocr-bench data_files: - split: arxiv_math path: - bench_data/arxiv_math.jsonl - split: headers_footers path: - bench_data/headers_footers.jsonl - split: long_tiny_text path: - bench_data/long_tiny_text.jsonl - split: multi_column path: - bench_data/multi_column.jsonl - split: old_scans path: - bench_data/old_scans.jsonl - split: old_scans_math path: - bench_data/old_scans_math.jsonl - split: table_tests path: - bench_data/table_tests.jsonl - split: rotated path: - bench_data/rotated.jsonl - split: blank_pages path: - bench_data/blank_pages.jsonl - split: synthetic_exact_match path: - bench_data/synthetic_exact_match.jsonl - split: synthetic_footnotes path: - bench_data/synthetic_footnotes.jsonl - split: synthetic_formatting path: - bench_data/synthetic_formatting.jsonl - split: synthetic_tables_hard path: - bench_data/synthetic_tables_hard.jsonl - split: synthetic_general path: - bench_data/synthetic_general.jsonl - split: synthetic_dense path: - bench_data/synthetic_dense.jsonl language: - en pretty_name: olmOCR-bench size_categories: - 10K<n<100K --- # olmOCR-bench-1.5-preview olmOCR-bench-1.5-preview is a preview follow up to the original [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) that adds several new synthetic benchmark categories. In addition to the original 1,403 PDF files, plus 7,010 unit test cases that were manually created as part of [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench), this repo contains additional, synthetic tests designed to test difficult OCR scenarios. In all synthetic cases, we sample PDFs from the same distribution as in [dolma3_mix-6T](https://huggingface.co/datasets/allenai/dolma3_mix-6T), then rerender them using Claude into clean sematic HTML following the [olmOCR sythentic pipeline](https://github.com/allenai/olmocr/blob/devel/olmocr/synth/mine_html_templates.py), and automatically extract test cases. In total, the benchmark now contains 28,770 test cases across 3,401 unique PDFs spanning 15 categories. - *Rotated*, we sample 140 pdfs from the original [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) and rotate their pages either 90, 180, or 270 degrees, so that you can easily test to see if your OCR model is rotation invariant. - *Blank Pages*, we sample 102 blank or mostly blank documents and test that the model should output very little text for these pages. This tests for model hallucinations. - *Synthetic exact match*, these tests consist of typos inserted into otherwise normal PDFs. We test for an exact text match, expecting a good OCR pipeline to be faithful to the text as written in the original document, even if there is an obvious typo. - *Synthetic footnotes*, we sample documents with footnotes and check that those footnotes (subscripts and superscripts) get faithfully transcribed by an OCR tool. - *Synthetic formatting*, we check that models apply bold, italic, and heading tags in the appropriate places in a document. - *Synthetic tables hard*, we further synthetically augment tables to include extra rows and columns, and check that they are represented well. - *Synthetic general*, a broad set of synthetically rendered documents testing all major OCR capabilities including text presence, reading order, tables, math, footnotes, and formatting. - *Synthetic dense*, the same set of documents as `synthetic_general`, except they have been augmented to have more densely layered text and features with smaller font sizes. Quick links: - 📃 [Paper](https://olmocr.allenai.org/papers/olmocr.pdf) - 🛠️ [Code](https://github.com/allenai/olmocr) - 🎮 [Demo](https://olmocr.allenai.org/) ## Table 1. Distribution of Test Classes by Document Source | Document Source | Text Present | Text Absent | Reading Order | Table | Math | Footnote | Formatting | Total Tests | Unique PDFs | |-------------------------|--------------|-------------|---------------|-------|------|----------|------------|-------------|-------------| | arXiv Math | - | - | - | - | 2,927| - | - | 2,927 | 522 | | Headers Footers | - | 753 | - | - | - | - | - | 760 | 266 | | Long Tiny Text | 442 | - | - | - | - | - | - | 442 | 62 | | Multi Column | - | - | 884 | - | - | - | - | 884 | 231 | | Old Scans | 279 | 70 | 177 | - | - | - | - | 526 | 98 | | Old Scans Math | - | - | - | - | 458 | - | - | 458 | 36 | | Table Tests | - | - | - | 1,020 | - | - | - | 1,022 | 188 | | Rotated - **new** | 65 | 83 | 89 | 91 | 387 | - | - | 716 | 140 | | Blank Pages - **new** | - | - | - | - | - | - | - | 102 | 102 | | Synthetic Exact Match - **new** | 863 | - | - | - | - | - | - | 1,354 | 491 | | Synthetic Footnotes - **new** | - | - | - | - | - | 744 | - | 1,090 | 346 | | Synthetic Formatting - **new** | - | - | - | - | - | - | 998 | 1,337 | 339 | | Synthetic Tables Hard - **new** | 277 | 263 | 216 | 4,879 | - | 10 | 138 | 5,915 | 132 | | Synthetic General - **new** | 1,360 | 317 | 1,021 | 530 | 129 | 102 | 466 | 4,152 | 227 | | Synthetic Dense - **new** | 1,826 | 320 | 1,163 | 2,403 | 361 | 128 | 663 | 7,085 | 221 | | **Total** | **5,112** | **1,806** | **3,550** | **8,923** | **4,262** | **984** | **2,265** | **28,770** | **3,401** | ## Evaluation Criteria - Text Presence: Checks if a short text segment (1–3 sentences) is correctly identified in the OCR output. Supports fuzzy matching and positional constraints (e.g., must appear in the first/last N characters). Case-sensitive by default. - Text Absence: Ensures specified text (e.g., headers, footers, page numbers) is excluded. Supports fuzzy matching and positional constraints. Not case-sensitive. - Natural Reading Order: Verifies the relative order of two text spans (e.g., headline before paragraph). Soft matching enabled; case-sensitive by default. - Table Accuracy: Confirms that specific cell values exist in tables with correct neighboring relationships (e.g., value above/below another). Supports Markdown and HTML, though complex structures require HTML. - Math Formula Accuracy: Detects the presence of a target equation by matching symbol layout (e.g., $\int$ to the left of $x$). Based on rendered bounding boxes and relative positioning. - Formatting: Verifies that specific text appears with correct formatting (heading, bold, or italic). Extracts all formatted text from the output using Markdown and HTML patterns, then checks for fuzzy matches against the expected text. - Footnote: Verifies that footnote markers appear correctly in the output. Checks for markers in Markdown (`[^1]`), HTML (`<sup>1</sup>`), and Unicode superscript formats, with optional validation that specific text appears immediately before or after the marker. - Baseline: Ensures basic output quality — the page is not blank (contains alphanumeric characters), has no excessive n-gram repetition, and contains no disallowed character sets (e.g., CJK, emoji). Also used to verify that blank pages produce minimal output. ### License This dataset is licensed under ODC-BY-1.0. It is intended for research and educational use in accordance with AI2's [Responsible Use Guidelines](https://allenai.org/responsible-use).
提供机构:
allenai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作