Darayut/khmer-textline-detection
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Darayut/khmer-textline-detection
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- object-detection
language:
- km
tags:
- khmer
- cambodia
- document
- text-detection
- text-line
- synthetic
- YOLO
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- split: val
path: data/val-*.parquet
---
# Synthetic Khmer Document Text-Line Detection Dataset
Synthetic dataset for **single-class text-line detection** on Cambodian
(Khmer) official documents — press releases, ministry letters, formal memos.
Generated with a procedural Pillow-based pipeline featuring:
- **8 layout templates** (standard, letter, announcement, report, sparse,
two-column, memo, plain)
- **12 page sizes** from A5 to A4-landscape
- **Variable margins, font sizes, line spacing, and indentation**
- **Photometric augmentations** (brightness, blur, noise, JPEG, shadow, vignette, fold/crease)
## Class
| ID | Name | Description |
|----|------|-------------|
| 0 | `text_line` | Any horizontal line of text |
## Dataset statistics
| Split | Images |
|-------|--------|
| train | 1732 |
| val | 300 |
## Schema
```python
{
"image": Image(),
"image_id": Value("string"), # e.g. "kh_doc_000042"
"split": Value("string"), # "train" | "val"
"width": Value("int32"),
"height": Value("int32"),
"annotations": Sequence({
"bbox": Sequence(Value("float32"), length=4), # [cx,cy,w,h] normalised
"cls_id": Value("int32"), # always 0
}),
}
```
## Load with HF Datasets
```python
from datasets import load_dataset
ds = load_dataset("Darayut/khmer-textline-detection")
sample = ds["train"][0]
print(sample["image"]) # PIL Image
print(sample["annotations"]) # dict of lists
```
## Raw YOLO files
`data/yolo_raw.zip` contains the native YOLO directory layout
(`images/`, `labels/`, `dataset.yaml`) for direct Ultralytics training:
```python
from huggingface_hub import hf_hub_download
import zipfile, pathlib
zip_path = hf_hub_download(
repo_id = "Darayut/khmer-textline-detection",
filename = "data/yolo_raw.zip",
repo_type = "dataset",
)
with zipfile.ZipFile(zip_path) as zf:
zf.extractall("./khmer_doc_yolo")
# yolo train data=khmer_doc_yolo/dataset.yaml model=yolo11n.pt
```
提供机构:
Darayut



