five

Lukaszl/pl-mixed-docs-ocr-dataset-100-v1-results

收藏
Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Lukaszl/pl-mixed-docs-ocr-dataset-100-v1-results
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit tags: - ocr-bench - leaderboard source_datasets: - Lukaszl/pl-mixed-docs-ocr-dataset-100-v1 configs: - config_name: default data_files: - split: train path: data/train-*.parquet - config_name: comparisons data_files: - split: train path: comparisons/train-*.parquet - config_name: leaderboard data_files: - split: train path: leaderboard/train-*.parquet - config_name: metadata data_files: - split: train path: metadata/train-*.parquet --- # OCR Bench Results: Polish mixed documents benchmark VLM-as-judge pairwise evaluation of OCR models on a small heterogeneous sample of Polish document-style images. Rankings depend strongly on document type, so this should be read as a document-specific OCR benchmark rather than a universal OCR ranking. This benchmark uses a lightweight 100-image Polish OCR sample covering mixed document categories such as official forms, templates, certificates, structured layouts, invoices, and document-style graphics. This is **not** a hard-case OCR benchmark focused on mobile photos, severe blur, heavy rotation, handwriting or strongly degraded scans. Instead, it measures OCR performance on a broad mix of more standard Polish document-like images with varied layouts and formatting. ## Leaderboard | Rank | Model | Params | ELO | 95% CI | Wins | Losses | Ties | Win% | |------|-------|--------|-----|--------|------|--------|------|------| | 1 | clearocr.com/clearocr-api | | 1735 | 1704–1773 | 326 | 53 | 21 | 82% | | 2 | rednote-hilab/dots.ocr | 1.7B | 1491 | 1462–1521 | 185 | 190 | 25 | 46% | | 3 | zai-org/GLM-OCR | 0.9B | 1451 | 1422–1480 | 159 | 215 | 26 | 40% | | 4 | lightonai/LightOnOCR-2-1B | 1B | 1451 | 1424–1481 | 167 | 224 | 9 | 42% | | 5 | FireRedTeam/FireRed-OCR | 2.1B | 1373 | 1342–1403 | 119 | 274 | 7 | 30% | ## Interpretation On this dataset, **clearocr.com/clearocr-api** ranked first with a clear margin over the remaining models. The rest of the evaluated OCR models were much closer to each other, while clearOCR separated itself as the strongest system on this mixed Polish document sample. ## Details - **Task**: OCR (Optical Character Recognition) - **Language**: Polish - **Document type**: Mixed Polish document-style images - **Original upstream dataset**: [`No240we1/polish_documents`](https://huggingface.co/datasets/No240we1/polish_documents) - **Source dataset**: [`Lukaszl/pl-mixed-docs-ocr-dataset-100-v1`](https://huggingface.co/datasets/Lukaszl/pl-mixed-docs-ocr-dataset-100-v1) - **Judge**: Qwen3.5-35B-A3B - **Comparisons**: 1000 - **Method**: Bradley-Terry MLE with bootstrap 95% CIs ## About clearOCR [clearOCR](https://clearocr.com) is an OCR API for extracting text from PDFs, scans and document images, with a strong focus on **Polish and English documents**. New accounts currently receive: - **1,000 free single-image OCR runs** - valid for **30 days** API access is available via the clearOCR website: https://clearocr.com ## Configs - `load_dataset("Lukaszl/pl-mixed-docs-ocr-dataset-100-v1-results")` — leaderboard table - `load_dataset("Lukaszl/pl-mixed-docs-ocr-dataset-100-v1-results", name="comparisons")` — full pairwise comparison log - `load_dataset("Lukaszl/pl-mixed-docs-ocr-dataset-100-v1-results", name="metadata")` — evaluation run history *Generated by [ocr-bench](https://github.com/davanstrien/ocr-bench)*

许可证:MIT 标签: - ocr-bench - 排行榜 源数据集: - Lukaszl/pl-mixed-docs-ocr-dataset-100-v1 配置项: - 配置名称:default 数据文件: - 拆分集:train 路径:data/train-*.parquet - 配置名称:comparisons 数据文件: - 拆分集:train 路径:comparisons/train-*.parquet - 配置名称:leaderboard 数据文件: - 拆分集:train 路径:leaderboard/train-*.parquet - 配置名称:metadata 数据文件: - 拆分集:train 路径:metadata/train-*.parquet --- # OCR基准结果:波兰混合文档基准 本基准采用视觉语言模型(Visual Language Model, VLM)作为评判器的成对评估方式,在少量波兰语文档风格图像的异构样本上对光学字符识别(Optical Character Recognition, OCR)模型进行评测。由于排名结果与文档类型强相关,因此本基准应被视为针对特定文档类型的OCR评测基准,而非通用OCR排名榜单。 本基准使用包含100张图像的轻量化波兰语OCR样本,涵盖官方表单、模板、证书、结构化版式文档、发票以及文档类图形等混合文档类别。 本基准**并非**针对手机拍摄照片、严重模糊、大幅旋转、手写文本或严重退化扫描件的困难场景OCR评测。相反,它用于评测OCR模型在布局与格式多样的标准波兰语文档类图像上的综合性能。 ## 排行榜 | 排名 | 模型 | 参数规模 | ELO评分 | 95%置信区间 | 胜场 | 负场 | 平局 | 胜率 | |------|-------|--------|-----|--------|------|--------|------|------| | 1 | clearocr.com/clearocr-api | | 1735 | 1704–1773 | 326 | 53 | 21 | 82% | | 2 | rednote-hilab/dots.ocr | 1.7B | 1491 | 1462–1521 | 185 | 190 | 25 | 46% | | 3 | zai-org/GLM-OCR | 0.9B | 1451 | 1422–1480 | 159 | 215 | 26 | 40% | | 4 | lightonai/LightOnOCR-2-1B | 1B | 1451 | 1424–1481 | 167 | 224 | 9 | 42% | | 5 | FireRedTeam/FireRed-OCR | 2.1B | 1373 | 1342–1403 | 119 | 274 | 7 | 30% | ## 结果解读 在本数据集上,**clearocr.com/clearocr-api**以显著优势位居榜首,远超其余参评模型。其余参评OCR模型的性能差距较小,而clearOCR作为本混合波兰语文档样本上的最强系统脱颖而出。 ## 评测详情 - **任务**:光学字符识别(Optical Character Recognition, OCR) - **语言**:波兰语 - **文档类型**:混合波兰语文档风格图像 - **上游原始数据集**:[`No240we1/polish_documents`](https://huggingface.co/datasets/No240we1/polish_documents) - **源数据集**:[`Lukaszl/pl-mixed-docs-ocr-dataset-100-v1`](https://huggingface.co/datasets/Lukaszl/pl-mixed-docs-ocr-dataset-100-v1) - **评判模型**:Qwen3.5-35B-A3B - **对比次数**:1000次 - **评测方法**:采用带bootstrap 95%置信区间的Bradley-Terry极大似然估计法 ## 关于clearOCR [clearOCR](https://clearocr.com)是一款用于从PDF文件、扫描件及文档图像中提取文本的OCR API,其核心聚焦于**波兰语与英语文档**。 当前新注册账号可获得: - **1000次免费单图像OCR调用次数** - 有效期为**30天** API访问可通过clearOCR官网获取:https://clearocr.com ## 数据集配置使用方式 - `load_dataset("Lukaszl/pl-mixed-docs-ocr-dataset-100-v1-results")` — 排行榜数据表 - `load_dataset("Lukaszl/pl-mixed-docs-ocr-dataset-100-v1-results", name="comparisons")` — 完整成对评估日志 - `load_dataset("Lukaszl/pl-mixed-docs-ocr-dataset-100-v1-results", name="metadata")` — 评估运行历史记录 *本数据集由[ocr-bench](https://github.com/davanstrien/ocr-bench)生成*
提供机构:
Lukaszl
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作