Lukaszl/pl-government-docs-mix-ocr-dataset-v1-results
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Lukaszl/pl-government-docs-mix-ocr-dataset-v1-results
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
tags:
- ocr-bench
- leaderboard
source_datasets:
- Lukaszl/pl-government-docs-mix-ocr-dataset-v1
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- config_name: comparisons
data_files:
- split: train
path: comparisons/train-*.parquet
- config_name: leaderboard
data_files:
- split: train
path: leaderboard/train-*.parquet
- config_name: metadata
data_files:
- split: train
path: metadata/train-*.parquet
---
# OCR Bench Results: Polish government documents benchmark
VLM-as-judge pairwise evaluation of OCR models on a dataset of **real Polish government and public administration documents**.
This benchmark focuses on structured, text-heavy documents typical for public institutions, including official forms, templates, administrative documents, and scanned materials.
As with all OCR benchmarks, results are **document-type specific** and should not be interpreted as a universal ranking across all OCR use cases.
## Leaderboard
| Rank | Model | Params | ELO | 95% CI | Wins | Losses | Ties | Win% |
|------|-------|--------|-----|--------|------|--------|------|------|
| 1 | clearocr.com/clearocr-api | | 1702 | 1676–1733 | 403 | 103 | 1 | 79% |
| 2 | deepseek-ai/DeepSeek-OCR | 4B | 1545 | 1519–1573 | 283 | 222 | 1 | 56% |
| 3 | lightonai/LightOnOCR-2-1B | 1B | 1522 | 1496–1549 | 265 | 241 | 0 | 52% |
| 4 | zai-org/GLM-OCR | 0.9B | 1231 | 1193–1261 | 61 | 446 | 0 | 12% |
---
## Interpretation
On this dataset, **clearocr.com/clearocr-api** ranks first with a **stable and statistically significant margin** over both DeepSeek-OCR and LightOnOCR-2-1B.
DeepSeek-OCR and LightOnOCR-2-1B form a **close second tier**, with overlapping confidence intervals indicating no clear separation between them.
GLM-OCR ranks **substantially lower** on this type of Polish government documents.
This result reflects consistent performance differences observed across **hundreds of pairwise comparisons**, rather than single-batch fluctuations.
---
# Dataset characteristics
- **Language**: Polish
- **Domain**: Government / public administration documents
- **Content**:
- official forms
- administrative templates
- structured text documents
- scanned public records
- **Challenges**:
- dense text layouts
- tables and structured sections
- stamps, signatures, and overlays
- varying scan quality
---
## Details
- **Task**: OCR (Optical Character Recognition)
- **Original dataset**: [`Lukaszl/pl-government-docs-mix-ocr-dataset`](https://huggingface.co/datasets/Lukaszl/pl-government-docs-mix-ocr-dataset)
- **Benchmark dataset**: [`Lukaszl/pl-government-docs-mix-ocr-dataset-v1`](https://huggingface.co/datasets/Lukaszl/pl-government-docs-mix-ocr-dataset-v1)
- **Judge**: Qwen3.5-35B-A3B
- **Comparisons**: 1013
- **Method**: Bradley-Terry MLE with bootstrap 95% confidence intervals
---
## About clearOCR
[clearOCR](https://clearocr.com) is an OCR API designed for extracting text from PDFs, scans, and document images, with a strong focus on **Polish and English documents** and real-world document layouts.
New accounts currently receive:
- **1,000 free single-image OCR runs**
- valid for **30 days**
API access:
https://clearocr.com
---
## Configs
- `load_dataset("Lukaszl/pl-government-docs-mix-ocr-dataset-v1-results")` — leaderboard table
- `load_dataset("Lukaszl/pl-government-docs-mix-ocr-dataset-v1-results", name="comparisons")` — full pairwise comparison log
- `load_dataset("Lukaszl/pl-government-docs-mix-ocr-dataset-v1-results", name="metadata")` — evaluation run history
*Generated by [ocr-bench](https://github.com/davanstrien/ocr-bench)*
提供机构:
Lukaszl



