LazyGreed/khmer_img2txt_200K
收藏Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LazyGreed/khmer_img2txt_200K
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
dataset_info:
features:
- name: image
dtype: image
- name: text
dtype: string
task_categories:
- image-to-text
language:
- km
- en
tags:
- Khmer
- ocr
- english
pretty_name: Khmer Image to Text dataset 200K images
size_categories:
- 100K<n<1M
---
# Dataset Card for Khmer English OCR 200K
## Dataset Summary
Khmer-English OCR 200K is a line-level OCR dataset stored as parquet files with two columns:
- `image`: a struct containing raw image bytes and a `path` field
- `text`: the transcription string for the image
The dataset currently contains:
- `train`: 179,988 examples
- `val`: 19,996 examples
This dataset appears to target OCR training for Khmer and mixed Khmer-English text.
## Dataset Structure
### Data Instances
Each row has the following structure:
```python
{
"image": {
"bytes": b"...",
"path": "train_000256.png",
},
"text": "ហ្គីរ៉ូដ និង ពេជ អង្គ ដែលលើកឡើង",
}
```
### Data Fields
- `image.bytes`: Raw bytes of the image file.
- `image.path`: Original image filename from `labels.txt` (for example `train_000256.png`).
- `text`: UTF-8 transcription text. During parquet creation, carriage returns and newlines are removed.
## Intended Uses
- Training OCR models for mixed Khmer-English line recognition
- Validation and benchmarking for sequence recognition models
## Limitations
- The dataset may contain noisy labels, mixed scripts, punctuation variation, or OCR-unfriendly crops.
- The transcription preprocessing removes newline and carriage return characters.
## Citation
If you publish or share this dataset, add a citation here:
```bibtex
@dataset{kh_en_ocr_200k,
title = {Khmer Image to Text dataset 200K images},
author = {LazyGreed},
year = {2026},
note = {Line-level OCR dataset}
}
```
提供机构:
LazyGreed



