five

piushorn/pdf-parse-bench

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/piushorn/pdf-parse-bench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - image-to-text - document-question-answering language: - en tags: - pdf-parsing - ocr - benchmark - mathematical-formulas - tables - llm-as-a-judge size_categories: - n<1K configs: - config_name: 2026-q1-tables-only data_files: - split: test path: 2026-q1-tables-only/test.jsonl - config_name: 2026-q1-formulas-only data_files: - split: test path: 2026-q1-formulas-only/test.jsonl --- # PDF Parse Bench [![GitHub](https://img.shields.io/badge/GitHub-phorn1%2Fpdf--parse--bench-181717?logo=github&logoColor=white)](https://github.com/phorn1/pdf-parse-bench) [![PyPI](https://img.shields.io/pypi/v/pdf-parse-bench)](https://pypi.org/project/pdf-parse-bench/) [![arXiv](https://img.shields.io/badge/arXiv-2512.09874-b31b1b?logo=arxiv)](https://arxiv.org/abs/2512.09874) [![arXiv](https://img.shields.io/badge/arXiv-2603.18652-b31b1b?logo=arxiv)](https://arxiv.org/abs/2603.18652) Benchmark for evaluating how effectively PDF parsing solutions extract **mathematical formulas** and **tables** from documents. We generate synthetic PDFs with diverse formatting scenarios, parse them with different parsers, and score the extracted content using **LLM-as-a-Judge**. This semantic evaluation approach [substantially outperforms traditional metrics](https://github.com/phorn1/pdf-parse-bench#why-llm-as-a-judge) in agreement with human judgment. ## Leaderboard (2026-Q1) Results are based on two benchmark datasets, each containing 100 synthetic PDFs: | Parser | Tables | Formulas | |--------|--------|----------| | [Gemini 3 Flash](https://deepmind.google/models/gemini/flash/) | 9.50 | 9.79 | | [LightOnOCR-2-1B](https://huggingface.co/lightonai/LightOnOCR-2-1B) | 9.08 | 9.57 | | [Mistral OCR](https://mistral.ai/) | 8.89 | 9.48 | | [dots.ocr](https://github.com/rednote-hilab/dots.ocr) | 8.73 | 9.55 | | [Mathpix](https://mathpix.com/) | 8.53 | 9.66 | | [Chandra](https://huggingface.co/datalab-to/chandra) | 8.43 | 9.45 | | [Qwen3-VL-235B](https://github.com/QwenLM/Qwen3-VL) | 8.43 | 9.84 | | [MonkeyOCR-pro-3B](https://github.com/Yuliang-Liu/MonkeyOCR) | 8.39 | 9.50 | | [GLM-4.5V](https://github.com/zai-org/GLM-V) | 7.98 | 9.37 | | [GPT-5 mini](https://openai.com/) | 7.14 | 5.57 | | [Claude Sonnet 4.6](https://docs.anthropic.com/en/docs/about-claude/models) | 7.02 | 8.50 | | [Nanonets-OCR-s](https://huggingface.co/nanonets/Nanonets-OCR-s) | 6.92 | 9.21 | | [PP-StructureV3](https://github.com/PaddlePaddle/PaddleOCR) | 6.86 | 9.59 | | [Gemini 2.5 Flash](https://deepmind.google/models/gemini/flash/) | 6.85 | 6.51 | | [MinerU2.5](https://mineru.net/) | 6.49 | 9.32 | | [GPT-5 nano](https://openai.com/) | 6.48 | 4.78 | | [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR) | 5.75 | 8.97 | | [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5) | 5.39 | 8.47 | | [PyMuPDF4LLM](https://github.com/pymupdf/PyMuPDF4LLM) | 5.25 | 4.53 | | [GOT-OCR2.0](https://github.com/Ucas-HaoranWei/GOT-OCR2.0) | 5.13 | 8.01 | | [olmOCR-2-7B](https://github.com/allenai/olmocr) | 4.05 | 9.35 | | [GROBID](https://github.com/kermitt2/grobid) | 2.10 | 7.01 | All scores are **LLM-as-a-Judge** ratings on a 0–10 scale, judged by Gemini 3 Flash via OpenRouter. ## Datasets - **`2026-q1-tables-only`** — 100 PDFs with 451 tables (simple, moderate, complex) - **`2026-q1-formulas-only`** — 100 PDFs with 1413 inline + 657 display-mode mathematical formulas PDFs are generated synthetically using LaTeX with randomized parameters (document class, fonts, margins, column layout, line spacing). Since PDFs are generated from LaTeX source, ground truth is obtained automatically. ## How to Evaluate Your Parser ```bash pip install pdf-parse-bench ``` See the full evaluation guide at **[github.com/phorn1/pdf-parse-bench](https://github.com/phorn1/pdf-parse-bench)**. ## Why LLM-as-a-Judge? Rule-based metrics correlate poorly with human judgment. We validated this in two human annotation studies: - **[formula-metric-study](https://github.com/phorn1/formula-metric-study)** — 750 human ratings: text metrics r = 0.01, CDM r = 0.31, LLM judges r = 0.74–0.82 - **[table-metric-study](https://github.com/phorn1/table-metric-study)** — 1,500+ human ratings: rule-based (TEDS, GriTS) top at r = 0.70, LLM judges r = 0.94 ## Citation ```bibtex @misc{horn2025formulabench, title = {Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs}, author = {Horn, Pius and Keuper, Janis}, year = {2025}, eprint = {2512.09874}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2512.09874} } @misc{horn2026tablebench, title = {Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation}, author = {Horn, Pius and Keuper, Janis}, year = {2026}, eprint = {2603.18652}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2603.18652} } ``` ## Acknowledgments This work has been supported by the German Federal Ministry of Research, Technology and Space (BMFTR) in the program "Forschung an Fachhochschulen in Kooperation mit Unternehmen (FH-Kooperativ)" within the joint project **LLMpraxis** under grant 13FH622KX2. <p align="center"> <img src="https://raw.githubusercontent.com/phorn1/pdf-parse-bench/main/assets/BMFTR_logo.png" alt="BMFTR" width="150" /> <img src="https://raw.githubusercontent.com/phorn1/pdf-parse-bench/main/assets/HAW_logo.png" alt="HAW" width="150" /> </p>
提供机构:
piushorn
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作