Darayut/khmer-textline-dataset_v2
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Darayut/khmer-textline-dataset_v2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- object-detection
language:
- km
tags:
- khmer
- cambodia
- document
- text-detection
- logo-detection
- synthetic
- YOLO
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- split: val
path: data/val-*.parquet
---
# Synthetic Khmer Document Text & Logo Detection Dataset
Synthetic dataset for **dual-class text-line and graphical element detection** on Cambodian
(Khmer) official documents — press releases, ministry letters, formal memos, and ID cards.
Generated with a procedural Pillow-based pipeline featuring:
- **8 layout templates** (standard, letter, announcement, report, sparse,
two-column, memo, plain)
- **Procedural Asset Injection** (stamps, seals, and logos with perfect bounding boxes)
- **12 page sizes** from A5 to A4-landscape
- **Variable margins, font sizes, line spacing, and indentation**
- **Photometric augmentations** (brightness, blur, noise, JPEG, shadow, vignette, fold/crease)
## Classes
| ID | Name | Description |
|----|------|-------------|
| 0 | `text_line` | Any horizontal line of text |
| 1 | `image` | Graphical elements including stamps, seals, logos, and photos |
## Dataset statistics
| Split | Images |
|-------|--------|
| train | 3997 |
| val | 246 |
## Schema
```python
{
"image": Image(),
"image_id": Value("string"), # e.g. "kh_doc_000042"
"split": Value("string"), # "train" | "val"
"width": Value("int32"),
"height": Value("int32"),
"annotations": Sequence({
"bbox": Sequence(Value("float32"), length=4), #[cx,cy,w,h] normalised
"cls_id": Value("int32"), # 0: textline, 1: image
}),
}
```
提供机构:
Darayut



