Chanrith123333/khmer_english_ocr_image_line

Name: Chanrith123333/khmer_english_ocr_image_line
Creator: Chanrith123333
Published: 2026-01-22 00:54:34
License: 暂无描述

Hugging Face2026-01-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Chanrith123333/khmer_english_ocr_image_line

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - image-to-text language: - km - en pretty_name: Khmer English OCR image line size_categories: - 10M<n<100M --- # Khmer English OCR image line Dataset 📄 A large-scale synthetic dataset for training OCR models on **Khmer and English** text. This dataset contains **12 million** high-quality synthetic images of text lines. ## 🎯 Dataset Overview - **Total Images**: 12,138,214 - **Languages**: Khmer, English, and mixed - **Format**: Image-text pairs - **Use Case**: OCR model training ## 📋 Data Fields - **image**: PIL Image of the text line - **text**: Ground truth text string ## 💾 Usage ### Load with Hugging Face ```python from datasets import load_dataset dataset = load_dataset("mrrtmob/km_en_image_line") # Access an example example = dataset['train'][0] image = example['image'] # PIL Image text = example['text'] # str ``` ### Train with Kiri OCR ```bash kiri-ocr train \ --hf-dataset mrrtmob/km_en_image_line \ --epochs 50 \ --batch-size 32 ``` ## 🎨 Dataset Features - Multiple Khmer and English fonts - Realistic augmentations (noise, blur, rotation) - Variable text lengths (5-100 characters) ## 📚 Citation ```bibtex @dataset{khmer_english_ocr_image_line, author = {mrrtmob}, title = {Khmer English OCR image line Dataset}, year = {2026}, publisher = {Blizzer}, howpublished = {\url{https://huggingface.co/datasets/mrrtmob/khmer_english_ocr_image_line}} } ``` ## ⚖️ License CC BY 4.0 ## 🔗 Related - **Kiri OCR Library**: [github.com/mrrtmob/kiri-ocr](https://github.com/mrrtmob/kiri-ocr) ## ☕ Support - [Buy Me a Coffee](https://buymeacoffee.com/tmob) - [ABA Payway](https://link.payway.com.kh/ABAPAYfd4073965)

提供机构：

Chanrith123333

5,000+

优质数据集

54 个

任务类型

进入经典数据集