Darayut/khmer-textline-detection

Name: Darayut/khmer-textline-detection
Creator: Darayut
Published: 2026-03-22 07:04:25
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Darayut/khmer-textline-detection

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - object-detection language: - km tags: - khmer - cambodia - document - text-detection - text-line - synthetic - YOLO size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: data/train-*.parquet - split: val path: data/val-*.parquet --- # Synthetic Khmer Document Text-Line Detection Dataset Synthetic dataset for **single-class text-line detection** on Cambodian (Khmer) official documents — press releases, ministry letters, formal memos. Generated with a procedural Pillow-based pipeline featuring: - **8 layout templates** (standard, letter, announcement, report, sparse, two-column, memo, plain) - **12 page sizes** from A5 to A4-landscape - **Variable margins, font sizes, line spacing, and indentation** - **Photometric augmentations** (brightness, blur, noise, JPEG, shadow, vignette, fold/crease) ## Class | ID | Name | Description | |----|------|-------------| | 0 | `text_line` | Any horizontal line of text | ## Dataset statistics | Split | Images | |-------|--------| | train | 1732 | | val | 300 | ## Schema ```python { "image": Image(), "image_id": Value("string"), # e.g. "kh_doc_000042" "split": Value("string"), # "train" | "val" "width": Value("int32"), "height": Value("int32"), "annotations": Sequence({ "bbox": Sequence(Value("float32"), length=4), # [cx,cy,w,h] normalised "cls_id": Value("int32"), # always 0 }), } ``` ## Load with HF Datasets ```python from datasets import load_dataset ds = load_dataset("Darayut/khmer-textline-detection") sample = ds["train"][0] print(sample["image"]) # PIL Image print(sample["annotations"]) # dict of lists ``` ## Raw YOLO files `data/yolo_raw.zip` contains the native YOLO directory layout (`images/`, `labels/`, `dataset.yaml`) for direct Ultralytics training: ```python from huggingface_hub import hf_hub_download import zipfile, pathlib zip_path = hf_hub_download( repo_id = "Darayut/khmer-textline-detection", filename = "data/yolo_raw.zip", repo_type = "dataset", ) with zipfile.ZipFile(zip_path) as zf: zf.extractall("./khmer_doc_yolo") # yolo train data=khmer_doc_yolo/dataset.yaml model=yolo11n.pt ```

提供机构：

Darayut

5,000+

优质数据集

54 个

任务类型

进入经典数据集