SoyVitou/KhmerSynthetic1M
收藏Hugging Face2026-02-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SoyVitou/KhmerSynthetic1M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
pretty_name: KhmerSynthetic1MZip (images embedded in Parquet)
tags:
- khmer
- ocr
- synthetic
dataset_info:
features:
- name: id
dtype: int32
- name: image
dtype: image
- name: label
dtype: string
- name: file_name
dtype: string
---
# KhmerSynthetic1M (Compressed)
Synthetic Khmer OCR dataset (1,000,000 images) with labels. Images are renamed sequentially (`img_00000001.jpg`, …) and indexed by `metadata.parquet` for fast browsing in the Hugging Face data viewer.
## Contents
- `compressed_1m_dataset/`: JPEG images
- `compressed_1m_dataset/metadata.parquet`: manifest with columns:
- `id`: integer row id
- `image`: relative image filename
- `img_path`: same as `image` (explicit for viewers)
- `label`: ground-truth text
- `compressed_1m_dataset.db`: SQLite (`generated_meta`) mirroring the manifest
## Download / Use
```python
from datasets import load_dataset
ds = load_dataset("SoyVitou/KhmerSynthetic1M", streaming=True)
row = next(iter(ds["train"]))
print(row["image"], row["label"])
```
## Generation notes
- Rendered with multiple Khmer fonts (plus limited Latin), curved text augmentation, noise/lighting/brush/smudge effects.
- Images compressed to reduce size (JPEG quality ~32, optional resize).
- Filenames flattened/sequential for easier indexing.
## License
Research and academic use only. Commercial use is not permitted. By using this dataset you agree to comply with these terms.
## Citation
If you use this dataset in a paper, please cite:
```
@inproceedings{YourName2024KhmerSynthetic1M,
title = {KhmerSynthetic1M: Large-Scale Synthetic Khmer OCR Dataset},
author = {Your Name and Coauthors},
booktitle = {Proceedings of ...},
year = {2024}
}
```
## Contact
Issues / feedback: open a discussion on the Hugging Face dataset page.
提供机构:
SoyVitou



