arobin79/bangla-ocr-validation_data_printed
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/arobin79/bangla-ocr-validation_data_printed
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- bn
task_categories:
- image-to-text
tags:
- ocr
- bangla
- document-understanding
- vision-language
---
# Bangla OCR Validation Dataset (Printed + Scanned)
## 📌 Description
This dataset is a Bangla OCR validation dataset containing a mix of printed document images and their corresponding text annotations. It is designed to evaluate OCR and vision-language models on both clean digital text and scanned document images.
## 📊 Dataset Composition
- 1507 **line-level images** with text annotations
- 50 **full-page document images** with text
- Data includes:
- Printed/typed Bangla text (clean)
- Scanned document images (noisy, real-world)
## 🧾 Features
Each sample contains:
- `image`: Input image (line or page)
- `text`: Ground-truth Bangla transcription
- `type`: Indicates data type (`line` or `page`)
## 🎯 Use Cases
- Bangla OCR evaluation
- Document understanding
- Vision-language model validation
- Robustness testing (clean vs scanned)
## 🚀 Usage
```python
from datasets import load_dataset
dataset = load_dataset("arobin79/bangla-ocr-validation_data_printed")
sample = dataset["train"][0]
print(sample["text"])
sample["image"].show()
提供机构:
arobin79



