five

Darayut/khmer-textline-dataset_v2

收藏
Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Darayut/khmer-textline-dataset_v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - object-detection language: - km tags: - khmer - cambodia - document - text-detection - logo-detection - synthetic - YOLO size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: data/train-*.parquet - split: val path: data/val-*.parquet --- # Synthetic Khmer Document Text & Logo Detection Dataset Synthetic dataset for **dual-class text-line and graphical element detection** on Cambodian (Khmer) official documents — press releases, ministry letters, formal memos, and ID cards. Generated with a procedural Pillow-based pipeline featuring: - **8 layout templates** (standard, letter, announcement, report, sparse, two-column, memo, plain) - **Procedural Asset Injection** (stamps, seals, and logos with perfect bounding boxes) - **12 page sizes** from A5 to A4-landscape - **Variable margins, font sizes, line spacing, and indentation** - **Photometric augmentations** (brightness, blur, noise, JPEG, shadow, vignette, fold/crease) ## Classes | ID | Name | Description | |----|------|-------------| | 0 | `text_line` | Any horizontal line of text | | 1 | `image` | Graphical elements including stamps, seals, logos, and photos | ## Dataset statistics | Split | Images | |-------|--------| | train | 3997 | | val | 246 | ## Schema ```python { "image": Image(), "image_id": Value("string"), # e.g. "kh_doc_000042" "split": Value("string"), # "train" | "val" "width": Value("int32"), "height": Value("int32"), "annotations": Sequence({ "bbox": Sequence(Value("float32"), length=4), #[cx,cy,w,h] normalised "cls_id": Value("int32"), # 0: textline, 1: image }), } ```
提供机构:
Darayut
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作