shadid113/ACI-ocr-benchmark-EN
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/shadid113/ACI-ocr-benchmark-EN
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
license_name: research-only
language:
- en
task_categories:
- image-to-text
tags:
- ocr
- handwriting-recognition
- document-understanding
- vlm
- benchmark
pretty_name: OCR Evaluation Benchmark
size_categories:
- 1K<n<10K
---
# OCR Evaluation Benchmark
Unified evaluation dataset for **continual learning OCR in Vision-Language Models**.
## Splits
| Split | Category | Level | Samples | Source |
|---|---|---|---|---|
| `english_handwritten_line` | English Handwritten | Line | 1,500 | IAM Handwriting Database |
| `english_handwritten_page` | English Handwritten | Page | 50 | IAM (pseudo-pages) |
| `english_printed_line` | English Printed | Line | ~1,500 | OmniDocBench v1.5 |
| `english_printed_page` | English Printed | Page | 50 | OmniDocBench v1.5 |
## Usage
```python
from datasets import load_dataset
# Load a specific split
ds = load_dataset("shadid113/ACI-ocr-benchmark-EN", split="english_handwritten_line")
# Each sample has:
# image: PIL Image
# text: ground truth transcription
for sample in ds:
img = sample["image"]
text = sample["text"]
```
## Columns
- **image**: the document/line image (rendered in the viewer)
- **text**: ground truth transcription
- **id**: unique sample identifier
- **source_dataset**: original dataset (IAM / OmniDocBench)
- **source_id**: identifier in the original dataset
- **document_type**: document category (for English Printed)
## Sources & Licenses
- **IAM Handwriting Database** — free for non-commercial research
- **OmniDocBench v1.5** — research purposes only
## EDA
See the `benchmark_eda/` folder for exploratory data analysis figures and report.
提供机构:
shadid113



