five

LazyGreed/khmer_img2txt_200K

收藏
Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LazyGreed/khmer_img2txt_200K
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit dataset_info: features: - name: image dtype: image - name: text dtype: string task_categories: - image-to-text language: - km - en tags: - Khmer - ocr - english pretty_name: Khmer Image to Text dataset 200K images size_categories: - 100K<n<1M --- # Dataset Card for Khmer English OCR 200K ## Dataset Summary Khmer-English OCR 200K is a line-level OCR dataset stored as parquet files with two columns: - `image`: a struct containing raw image bytes and a `path` field - `text`: the transcription string for the image The dataset currently contains: - `train`: 179,988 examples - `val`: 19,996 examples This dataset appears to target OCR training for Khmer and mixed Khmer-English text. ## Dataset Structure ### Data Instances Each row has the following structure: ```python { "image": { "bytes": b"...", "path": "train_000256.png", }, "text": "ហ្គីរ៉ូដ និង ពេជ អង្គ ដែលលើកឡើង", } ``` ### Data Fields - `image.bytes`: Raw bytes of the image file. - `image.path`: Original image filename from `labels.txt` (for example `train_000256.png`). - `text`: UTF-8 transcription text. During parquet creation, carriage returns and newlines are removed. ## Intended Uses - Training OCR models for mixed Khmer-English line recognition - Validation and benchmarking for sequence recognition models ## Limitations - The dataset may contain noisy labels, mixed scripts, punctuation variation, or OCR-unfriendly crops. - The transcription preprocessing removes newline and carriage return characters. ## Citation If you publish or share this dataset, add a citation here: ```bibtex @dataset{kh_en_ocr_200k, title = {Khmer Image to Text dataset 200K images}, author = {LazyGreed}, year = {2026}, note = {Line-level OCR dataset} } ```
提供机构:
LazyGreed
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作