LazyGreed/khmer_img2txt_200K

Name: LazyGreed/khmer_img2txt_200K
Creator: LazyGreed
Published: 2026-03-09 07:52:14
License: 暂无描述

Hugging Face2026-03-09 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/LazyGreed/khmer_img2txt_200K

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit dataset_info: features: - name: image dtype: image - name: text dtype: string task_categories: - image-to-text language: - km - en tags: - Khmer - ocr - english pretty_name: Khmer Image to Text dataset 200K images size_categories: - 100K<n<1M --- # Dataset Card for Khmer English OCR 200K ## Dataset Summary Khmer-English OCR 200K is a line-level OCR dataset stored as parquet files with two columns: - `image`: a struct containing raw image bytes and a `path` field - `text`: the transcription string for the image The dataset currently contains: - `train`: 179,988 examples - `val`: 19,996 examples This dataset appears to target OCR training for Khmer and mixed Khmer-English text. ## Dataset Structure ### Data Instances Each row has the following structure: ```python { "image": { "bytes": b"...", "path": "train_000256.png", }, "text": "ហ្គីរ៉ូដ និង ពេជ អង្គ ដែលលើកឡើង", } ``` ### Data Fields - `image.bytes`: Raw bytes of the image file. - `image.path`: Original image filename from `labels.txt` (for example `train_000256.png`). - `text`: UTF-8 transcription text. During parquet creation, carriage returns and newlines are removed. ## Intended Uses - Training OCR models for mixed Khmer-English line recognition - Validation and benchmarking for sequence recognition models ## Limitations - The dataset may contain noisy labels, mixed scripts, punctuation variation, or OCR-unfriendly crops. - The transcription preprocessing removes newline and carriage return characters. ## Citation If you publish or share this dataset, add a citation here: ```bibtex @dataset{kh_en_ocr_200k, title = {Khmer Image to Text dataset 200K images}, author = {LazyGreed}, year = {2026}, note = {Line-level OCR dataset} } ```

提供机构：

LazyGreed

5,000+

优质数据集

54 个

任务类型

进入经典数据集