Chanrith123333/khmer_english_ocr_image_line
收藏Hugging Face2026-01-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Chanrith123333/khmer_english_ocr_image_line
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- image-to-text
language:
- km
- en
pretty_name: Khmer English OCR image line
size_categories:
- 10M<n<100M
---
# Khmer English OCR image line Dataset 📄
A large-scale synthetic dataset for training OCR models on **Khmer and English** text. This dataset contains **12 million** high-quality synthetic images of text lines.
## 🎯 Dataset Overview
- **Total Images**: 12,138,214
- **Languages**: Khmer, English, and mixed
- **Format**: Image-text pairs
- **Use Case**: OCR model training
## 📋 Data Fields
- **image**: PIL Image of the text line
- **text**: Ground truth text string
## 💾 Usage
### Load with Hugging Face
```python
from datasets import load_dataset
dataset = load_dataset("mrrtmob/km_en_image_line")
# Access an example
example = dataset['train'][0]
image = example['image'] # PIL Image
text = example['text'] # str
```
### Train with Kiri OCR
```bash
kiri-ocr train \
--hf-dataset mrrtmob/km_en_image_line \
--epochs 50 \
--batch-size 32
```
## 🎨 Dataset Features
- Multiple Khmer and English fonts
- Realistic augmentations (noise, blur, rotation)
- Variable text lengths (5-100 characters)
## 📚 Citation
```bibtex
@dataset{khmer_english_ocr_image_line,
author = {mrrtmob},
title = {Khmer English OCR image line Dataset},
year = {2026},
publisher = {Blizzer},
howpublished = {\url{https://huggingface.co/datasets/mrrtmob/khmer_english_ocr_image_line}}
}
```
## ⚖️ License
CC BY 4.0
## 🔗 Related
- **Kiri OCR Library**: [github.com/mrrtmob/kiri-ocr](https://github.com/mrrtmob/kiri-ocr)
## ☕ Support
- [Buy Me a Coffee](https://buymeacoffee.com/tmob)
- [ABA Payway](https://link.payway.com.kh/ABAPAYfd4073965)
提供机构:
Chanrith123333



