five

Chanrith123333/khmer_english_ocr_image_line

收藏
Hugging Face2026-01-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Chanrith123333/khmer_english_ocr_image_line
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - image-to-text language: - km - en pretty_name: Khmer English OCR image line size_categories: - 10M<n<100M --- # Khmer English OCR image line Dataset 📄 A large-scale synthetic dataset for training OCR models on **Khmer and English** text. This dataset contains **12 million** high-quality synthetic images of text lines. ## 🎯 Dataset Overview - **Total Images**: 12,138,214 - **Languages**: Khmer, English, and mixed - **Format**: Image-text pairs - **Use Case**: OCR model training ## 📋 Data Fields - **image**: PIL Image of the text line - **text**: Ground truth text string ## 💾 Usage ### Load with Hugging Face ```python from datasets import load_dataset dataset = load_dataset("mrrtmob/km_en_image_line") # Access an example example = dataset['train'][0] image = example['image'] # PIL Image text = example['text'] # str ``` ### Train with Kiri OCR ```bash kiri-ocr train \ --hf-dataset mrrtmob/km_en_image_line \ --epochs 50 \ --batch-size 32 ``` ## 🎨 Dataset Features - Multiple Khmer and English fonts - Realistic augmentations (noise, blur, rotation) - Variable text lengths (5-100 characters) ## 📚 Citation ```bibtex @dataset{khmer_english_ocr_image_line, author = {mrrtmob}, title = {Khmer English OCR image line Dataset}, year = {2026}, publisher = {Blizzer}, howpublished = {\url{https://huggingface.co/datasets/mrrtmob/khmer_english_ocr_image_line}} } ``` ## ⚖️ License CC BY 4.0 ## 🔗 Related - **Kiri OCR Library**: [github.com/mrrtmob/kiri-ocr](https://github.com/mrrtmob/kiri-ocr) ## ☕ Support - [Buy Me a Coffee](https://buymeacoffee.com/tmob) - [ABA Payway](https://link.payway.com.kh/ABAPAYfd4073965)
提供机构:
Chanrith123333
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作