SoyVitou/KhmerSynthetic1M

Name: SoyVitou/KhmerSynthetic1M
Creator: SoyVitou
Published: 2026-02-01 10:52:31
License: 暂无描述

Hugging Face2026-02-01 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/SoyVitou/KhmerSynthetic1M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 pretty_name: KhmerSynthetic1MZip (images embedded in Parquet) tags: - khmer - ocr - synthetic dataset_info: features: - name: id dtype: int32 - name: image dtype: image - name: label dtype: string - name: file_name dtype: string --- # KhmerSynthetic1M (Compressed) Synthetic Khmer OCR dataset (1,000,000 images) with labels. Images are renamed sequentially (`img_00000001.jpg`, …) and indexed by `metadata.parquet` for fast browsing in the Hugging Face data viewer. ## Contents - `compressed_1m_dataset/`: JPEG images - `compressed_1m_dataset/metadata.parquet`: manifest with columns: - `id`: integer row id - `image`: relative image filename - `img_path`: same as `image` (explicit for viewers) - `label`: ground-truth text - `compressed_1m_dataset.db`: SQLite (`generated_meta`) mirroring the manifest ## Download / Use ```python from datasets import load_dataset ds = load_dataset("SoyVitou/KhmerSynthetic1M", streaming=True) row = next(iter(ds["train"])) print(row["image"], row["label"]) ``` ## Generation notes - Rendered with multiple Khmer fonts (plus limited Latin), curved text augmentation, noise/lighting/brush/smudge effects. - Images compressed to reduce size (JPEG quality ~32, optional resize). - Filenames flattened/sequential for easier indexing. ## License Research and academic use only. Commercial use is not permitted. By using this dataset you agree to comply with these terms. ## Citation If you use this dataset in a paper, please cite: ``` @inproceedings{YourName2024KhmerSynthetic1M, title = {KhmerSynthetic1M: Large-Scale Synthetic Khmer OCR Dataset}, author = {Your Name and Coauthors}, booktitle = {Proceedings of ...}, year = {2024} } ``` ## Contact Issues / feedback: open a discussion on the Hugging Face dataset page.

提供机构：

SoyVitou

5,000+

优质数据集

54 个

任务类型

进入经典数据集