allenai/olmOCR-bench-1.5-preview
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/allenai/olmOCR-bench-1.5-preview
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
tags:
- text
configs:
- config_name: olmocr-bench
data_files:
- split: arxiv_math
path:
- bench_data/arxiv_math.jsonl
- split: headers_footers
path:
- bench_data/headers_footers.jsonl
- split: long_tiny_text
path:
- bench_data/long_tiny_text.jsonl
- split: multi_column
path:
- bench_data/multi_column.jsonl
- split: old_scans
path:
- bench_data/old_scans.jsonl
- split: old_scans_math
path:
- bench_data/old_scans_math.jsonl
- split: table_tests
path:
- bench_data/table_tests.jsonl
- split: rotated
path:
- bench_data/rotated.jsonl
- split: blank_pages
path:
- bench_data/blank_pages.jsonl
- split: synthetic_exact_match
path:
- bench_data/synthetic_exact_match.jsonl
- split: synthetic_footnotes
path:
- bench_data/synthetic_footnotes.jsonl
- split: synthetic_formatting
path:
- bench_data/synthetic_formatting.jsonl
- split: synthetic_tables_hard
path:
- bench_data/synthetic_tables_hard.jsonl
- split: synthetic_general
path:
- bench_data/synthetic_general.jsonl
- split: synthetic_dense
path:
- bench_data/synthetic_dense.jsonl
language:
- en
pretty_name: olmOCR-bench
size_categories:
- 10K<n<100K
---
# olmOCR-bench-1.5-preview
olmOCR-bench-1.5-preview is a preview follow up to the original [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) that adds several
new synthetic benchmark categories.
In addition to the original 1,403 PDF files, plus 7,010 unit test cases that were manually created as part of [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench), this repo contains additional, synthetic tests designed to test difficult OCR scenarios. In all synthetic cases, we sample PDFs from the same distribution as in [dolma3_mix-6T](https://huggingface.co/datasets/allenai/dolma3_mix-6T), then rerender them using Claude into clean sematic HTML
following the [olmOCR sythentic pipeline](https://github.com/allenai/olmocr/blob/devel/olmocr/synth/mine_html_templates.py), and automatically extract test
cases. In total, the benchmark now contains 28,770 test cases across 3,401 unique PDFs spanning 15 categories.
- *Rotated*, we sample 140 pdfs from the original [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) and rotate their pages either 90, 180, or 270 degrees, so that you can easily test to see if your OCR model is rotation invariant.
- *Blank Pages*, we sample 102 blank or mostly blank documents and test that the model should output very little text for these pages. This tests for model hallucinations.
- *Synthetic exact match*, these tests consist of typos inserted into otherwise normal PDFs. We test for an exact text match, expecting a good OCR
pipeline to be faithful to the text as written in the original document, even if there is an obvious typo.
- *Synthetic footnotes*, we sample documents with footnotes and check that those footnotes (subscripts and superscripts) get faithfully transcribed by an OCR tool.
- *Synthetic formatting*, we check that models apply bold, italic, and heading tags in the appropriate places in a document.
- *Synthetic tables hard*, we further synthetically augment tables to include extra rows and columns, and check that they are represented well.
- *Synthetic general*, a broad set of synthetically rendered documents testing all major OCR capabilities including text presence, reading order, tables, math, footnotes, and formatting.
- *Synthetic dense*, the same set of documents as `synthetic_general`, except they have been augmented to have more densely layered text and features with smaller font sizes.
Quick links:
- 📃 [Paper](https://olmocr.allenai.org/papers/olmocr.pdf)
- 🛠️ [Code](https://github.com/allenai/olmocr)
- 🎮 [Demo](https://olmocr.allenai.org/)
## Table 1. Distribution of Test Classes by Document Source
| Document Source | Text Present | Text Absent | Reading Order | Table | Math | Footnote | Formatting | Total Tests | Unique PDFs |
|-------------------------|--------------|-------------|---------------|-------|------|----------|------------|-------------|-------------|
| arXiv Math | - | - | - | - | 2,927| - | - | 2,927 | 522 |
| Headers Footers | - | 753 | - | - | - | - | - | 760 | 266 |
| Long Tiny Text | 442 | - | - | - | - | - | - | 442 | 62 |
| Multi Column | - | - | 884 | - | - | - | - | 884 | 231 |
| Old Scans | 279 | 70 | 177 | - | - | - | - | 526 | 98 |
| Old Scans Math | - | - | - | - | 458 | - | - | 458 | 36 |
| Table Tests | - | - | - | 1,020 | - | - | - | 1,022 | 188 |
| Rotated - **new** | 65 | 83 | 89 | 91 | 387 | - | - | 716 | 140 |
| Blank Pages - **new** | - | - | - | - | - | - | - | 102 | 102 |
| Synthetic Exact Match - **new** | 863 | - | - | - | - | - | - | 1,354 | 491 |
| Synthetic Footnotes - **new** | - | - | - | - | - | 744 | - | 1,090 | 346 |
| Synthetic Formatting - **new** | - | - | - | - | - | - | 998 | 1,337 | 339 |
| Synthetic Tables Hard - **new** | 277 | 263 | 216 | 4,879 | - | 10 | 138 | 5,915 | 132 |
| Synthetic General - **new** | 1,360 | 317 | 1,021 | 530 | 129 | 102 | 466 | 4,152 | 227 |
| Synthetic Dense - **new** | 1,826 | 320 | 1,163 | 2,403 | 361 | 128 | 663 | 7,085 | 221 |
| **Total** | **5,112** | **1,806** | **3,550** | **8,923** | **4,262** | **984** | **2,265** | **28,770** | **3,401** |
## Evaluation Criteria
- Text Presence: Checks if a short text segment (1–3 sentences) is correctly identified in the OCR output. Supports fuzzy matching and positional constraints (e.g., must appear in the first/last N characters). Case-sensitive by default.
- Text Absence: Ensures specified text (e.g., headers, footers, page numbers) is excluded. Supports fuzzy matching and positional constraints. Not case-sensitive.
- Natural Reading Order: Verifies the relative order of two text spans (e.g., headline before paragraph). Soft matching enabled; case-sensitive by default.
- Table Accuracy: Confirms that specific cell values exist in tables with correct neighboring relationships (e.g., value above/below another). Supports Markdown and HTML, though complex structures require HTML.
- Math Formula Accuracy: Detects the presence of a target equation by matching symbol layout (e.g., $\int$ to the left of $x$). Based on rendered bounding boxes and relative positioning.
- Formatting: Verifies that specific text appears with correct formatting (heading, bold, or italic). Extracts all formatted text from the output using Markdown and HTML patterns, then checks for fuzzy matches against the expected text.
- Footnote: Verifies that footnote markers appear correctly in the output. Checks for markers in Markdown (`[^1]`), HTML (`<sup>1</sup>`), and Unicode superscript formats, with optional validation that specific text appears immediately before or after the marker.
- Baseline: Ensures basic output quality — the page is not blank (contains alphanumeric characters), has no excessive n-gram repetition, and contains no disallowed character sets (e.g., CJK, emoji). Also used to verify that blank pages produce minimal output.
### License
This dataset is licensed under ODC-BY-1.0. It is intended for research and educational use in accordance with AI2's [Responsible Use Guidelines](https://allenai.org/responsible-use).
提供机构:
allenai



