five

Darayut/Textline-Detection-Dataset

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Darayut/Textline-Detection-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - km - en task_categories: - object-detection tags: - khmer - text-detection - ocr - document-analysis - yolo - synthetic - scene-text pretty_name: Khmer Text Detection (Ultimate) size_categories: - 10K<n<100K --- # Textline Detection — Ultimate Dataset A large-scale, multi-source dataset for **textline detection** using YOLO-format bounding box annotations. Combines real scene text, document layout images, and synthetically generated Khmer document images. --- ## Dataset Summary | Split | Images | |-------|--------| | Train | 30,658 | | Val | 2,764 | | **Total** | **33,422** | ### Classes | ID | Name | Description | |----|------|-------------| | 0 | `text_line` | A line of Khmer or mixed-script text | | 1 | `image` | An embedded image/figure region within a document | --- ## Data Sources ### 1. Real Scene Text Images captured in natural environments — street signs, storefronts, billboards, and handwritten Khmer documents. Derived from the **DonkeySmall** base dataset (~26K images). ### 2. Document Layout Images from the **DocLayNet** multi-class document layout corpus, re-labelled for the two-class schema. Covers documents: official reports, newspapers, and books. ### 3. Synthetic Khmer Documents Programmatically generated document images using custom Khmer text rendering pipelines. Fonts, sizes, backgrounds, and layouts are randomised. Labels are auto-generated (zero annotation cost). --- ## Annotation Format Labels follow **YOLO v8** format — coordinates normalised to `[0, 1]`: ``` <class_id> <cx> <cy> <width> <height> ``` The `annotations` field is a JSON-serialised list: ```json [ {"class_id": 0, "cx": 0.512, "cy": 0.234, "w": 0.310, "h": 0.045}, {"class_id": 1, "cx": 0.720, "cy": 0.600, "w": 0.200, "h": 0.250} ] ``` --- ## Dataset Fields | Field | Type | Description | |-------|------|-------------| | `image` | `Image` | Decoded PIL image | | `image_path` | `string` | Original file path at collection time | | `source` | `string` | `real_scene_text` / `doclaynet_khmer` / `synthetic_khmer_doc` | | `split` | `string` | `train` or `val` | | `annotations` | `string` | JSON list of YOLO bounding boxes | | `num_objects` | `int32` | Number of annotated objects | --- ## Usage ```python from datasets import load_dataset ds = load_dataset("Darayut/Multilingual-Textline-Detection-Dataset") sample = ds["train"][0] print(sample["source"], sample["num_objects"]) sample["image"].show() ``` ### Convert back to YOLO label files ```python import json from pathlib import Path def save_labels(split, out_dir): Path(out_dir).mkdir(parents=True, exist_ok=True) for row in ds[split]: stem = Path(row["image_path"]).stem anns = json.loads(row["annotations"]) with open(f"{out_dir}/{stem}.txt", "w") as f: for a in anns: f.write(f"{a['class_id']} {a['cx']:.6f} {a['cy']:.6f} {a['w']:.6f} {a['h']:.6f}\n") save_labels("train", "labels/train") save_labels("val", "labels/val") ``` ### YAML config for YOLOv8 training ```yaml train: /path/to/images/train val: /path/to/images/val nc: 2 names: 0: text_line 1: image ``` --- ## Limitations - Synthetic images may not capture all real-world degradation (blur, skew, lighting variation). - Scene-text labels are semi-automatic and may have occasional missed detections. - Dataset is primarily Khmer script; other scripts appear only incidentally. --- ## License [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — free to use, share, and adapt with attribution. ---
提供机构:
Darayut
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作