piushorn/pdf-parse-bench

Name: piushorn/pdf-parse-bench
Creator: piushorn
Published: 2026-03-27 15:29:30
License: 暂无描述

Hugging Face2026-03-27 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/piushorn/pdf-parse-bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - image-to-text - document-question-answering language: - en tags: - pdf-parsing - ocr - benchmark - mathematical-formulas - tables - llm-as-a-judge size_categories: - n<1K configs: - config_name: 2026-q1-tables-only data_files: - split: test path: 2026-q1-tables-only/test.jsonl - config_name: 2026-q1-formulas-only data_files: - split: test path: 2026-q1-formulas-only/test.jsonl --- # PDF Parse Bench [![GitHub](https://img.shields.io/badge/GitHub-phorn1%2Fpdf--parse--bench-181717?logo=github&logoColor=white)](https://github.com/phorn1/pdf-parse-bench) [![PyPI](https://img.shields.io/pypi/v/pdf-parse-bench)](https://pypi.org/project/pdf-parse-bench/) [![arXiv](https://img.shields.io/badge/arXiv-2512.09874-b31b1b?logo=arxiv)](https://arxiv.org/abs/2512.09874) [![arXiv](https://img.shields.io/badge/arXiv-2603.18652-b31b1b?logo=arxiv)](https://arxiv.org/abs/2603.18652) Benchmark for evaluating how effectively PDF parsing solutions extract **mathematical formulas** and **tables** from documents. We generate synthetic PDFs with diverse formatting scenarios, parse them with different parsers, and score the extracted content using **LLM-as-a-Judge**. This semantic evaluation approach [substantially outperforms traditional metrics](https://github.com/phorn1/pdf-parse-bench#why-llm-as-a-judge) in agreement with human judgment. ## Leaderboard (2026-Q1) Results are based on two benchmark datasets, each containing 100 synthetic PDFs: | Parser | Tables | Formulas | |--------|--------|----------| | [Gemini 3 Flash](https://deepmind.google/models/gemini/flash/) | 9.50 | 9.79 | | [LightOnOCR-2-1B](https://huggingface.co/lightonai/LightOnOCR-2-1B) | 9.08 | 9.57 | | [Mistral OCR](https://mistral.ai/) | 8.89 | 9.48 | | [dots.ocr](https://github.com/rednote-hilab/dots.ocr) | 8.73 | 9.55 | | [Mathpix](https://mathpix.com/) | 8.53 | 9.66 | | [Chandra](https://huggingface.co/datalab-to/chandra) | 8.43 | 9.45 | | [Qwen3-VL-235B](https://github.com/QwenLM/Qwen3-VL) | 8.43 | 9.84 | | [MonkeyOCR-pro-3B](https://github.com/Yuliang-Liu/MonkeyOCR) | 8.39 | 9.50 | | [GLM-4.5V](https://github.com/zai-org/GLM-V) | 7.98 | 9.37 | | [GPT-5 mini](https://openai.com/) | 7.14 | 5.57 | | [Claude Sonnet 4.6](https://docs.anthropic.com/en/docs/about-claude/models) | 7.02 | 8.50 | | [Nanonets-OCR-s](https://huggingface.co/nanonets/Nanonets-OCR-s) | 6.92 | 9.21 | | [PP-StructureV3](https://github.com/PaddlePaddle/PaddleOCR) | 6.86 | 9.59 | | [Gemini 2.5 Flash](https://deepmind.google/models/gemini/flash/) | 6.85 | 6.51 | | [MinerU2.5](https://mineru.net/) | 6.49 | 9.32 | | [GPT-5 nano](https://openai.com/) | 6.48 | 4.78 | | [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR) | 5.75 | 8.97 | | [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5) | 5.39 | 8.47 | | [PyMuPDF4LLM](https://github.com/pymupdf/PyMuPDF4LLM) | 5.25 | 4.53 | | [GOT-OCR2.0](https://github.com/Ucas-HaoranWei/GOT-OCR2.0) | 5.13 | 8.01 | | [olmOCR-2-7B](https://github.com/allenai/olmocr) | 4.05 | 9.35 | | [GROBID](https://github.com/kermitt2/grobid) | 2.10 | 7.01 | All scores are **LLM-as-a-Judge** ratings on a 0–10 scale, judged by Gemini 3 Flash via OpenRouter. ## Datasets - **`2026-q1-tables-only`** — 100 PDFs with 451 tables (simple, moderate, complex) - **`2026-q1-formulas-only`** — 100 PDFs with 1413 inline + 657 display-mode mathematical formulas PDFs are generated synthetically using LaTeX with randomized parameters (document class, fonts, margins, column layout, line spacing). Since PDFs are generated from LaTeX source, ground truth is obtained automatically. ## How to Evaluate Your Parser ```bash pip install pdf-parse-bench ``` See the full evaluation guide at **[github.com/phorn1/pdf-parse-bench](https://github.com/phorn1/pdf-parse-bench)**. ## Why LLM-as-a-Judge? Rule-based metrics correlate poorly with human judgment. We validated this in two human annotation studies: - **[formula-metric-study](https://github.com/phorn1/formula-metric-study)** — 750 human ratings: text metrics r = 0.01, CDM r = 0.31, LLM judges r = 0.74–0.82 - **[table-metric-study](https://github.com/phorn1/table-metric-study)** — 1,500+ human ratings: rule-based (TEDS, GriTS) top at r = 0.70, LLM judges r = 0.94 ## Citation ```bibtex @misc{horn2025formulabench, title = {Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs}, author = {Horn, Pius and Keuper, Janis}, year = {2025}, eprint = {2512.09874}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2512.09874} } @misc{horn2026tablebench, title = {Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation}, author = {Horn, Pius and Keuper, Janis}, year = {2026}, eprint = {2603.18652}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2603.18652} } ``` ## Acknowledgments This work has been supported by the German Federal Ministry of Research, Technology and Space (BMFTR) in the program "Forschung an Fachhochschulen in Kooperation mit Unternehmen (FH-Kooperativ)" within the joint project **LLMpraxis** under grant 13FH622KX2. <p align="center"> <img src="https://raw.githubusercontent.com/phorn1/pdf-parse-bench/main/assets/BMFTR_logo.png" alt="BMFTR" width="150" /> <img src="https://raw.githubusercontent.com/phorn1/pdf-parse-bench/main/assets/HAW_logo.png" alt="HAW" width="150" /> </p>

提供机构：

piushorn

5,000+

优质数据集

54 个

任务类型

进入经典数据集