Darayut/khmer-textline-dataset
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Darayut/khmer-textline-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- object-detection
language:
- km
tags:
- khmer
- cambodia
- document
- text-detection
- text-line
- synthetic
- YOLO
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- split: val
path: data/val-*.parquet
---
# Synthetic Khmer Document Text-Line Detection Dataset
Synthetic dataset for **single-class text-line detection** on Cambodian
(Khmer) official documents — press releases, ministry letters, formal memos.
Generated with a procedural Pillow-based pipeline featuring:
- **8 layout templates** (standard, letter, announcement, report, sparse,
two-column, memo, plain)
- **12 page sizes** from A5 to A4-landscape
- **Variable margins, font sizes, line spacing, and indentation**
- **Photometric augmentations** (brightness, blur, noise, JPEG, shadow, vignette, fold/crease)
## Class
| ID | Name | Description |
|----|------|-------------|
| 0 | `text_line` | Any horizontal line of text |
## Dataset statistics
| Split | Images |
|-------|--------|
| train | 1955 |
| val | 345 |
## Schema
```python
{
"image": Image(),
"image_id": Value("string"), # e.g. "kh_doc_000042"
"split": Value("string"), # "train" | "val"
"width": Value("int32"),
"height": Value("int32"),
"annotations": Sequence({
"bbox": Sequence(Value("float32"), length=4), # [cx,cy,w,h] normalised
"cls_id": Value("int32"), # always 0
}),
}
```
## Load with HF Datasets
```python
from datasets import load_dataset
ds = load_dataset("Darayut/khmer-textline-dataset")
sample = ds["train"][0]
print(sample["image"]) # PIL Image
print(sample["annotations"]) # dict of lists
```
## Raw YOLO files
`data/yolo_raw.zip` contains the native YOLO directory layout
(`images/`, `labels/`, `dataset.yaml`) for direct Ultralytics training:
```python
from huggingface_hub import hf_hub_download
import zipfile, pathlib
zip_path = hf_hub_download(
repo_id = "Darayut/khmer-textline-dataset",
filename = "data/yolo_raw.zip",
repo_type = "dataset",
)
with zipfile.ZipFile(zip_path) as zf:
zf.extractall("./khmer_doc_yolo")
# yolo train data=khmer_doc_yolo/dataset.yaml model=yolo11n.pt
```
许可证:MIT
任务类别:
- 目标检测
语言:
- 高棉语(km)
标签:
- 高棉语(Khmer)
- 柬埔寨
- 文档
- 文本检测
- 文本行
- 合成
- YOLO
样本量范围:
- 1000 < 样本量 < 10000
配置项:
- 配置名称:默认(default)
数据文件:
- 拆分:训练集(train)
路径:data/train-*.parquet
- 拆分:验证集(val)
路径:data/val-*.parquet
# 合成高棉语文本文本行检测数据集
本数据集为针对柬埔寨(高棉语)官方文档(包括新闻稿、部委公函、正式备忘录)的**单类别文本行检测**合成数据集。
本数据集基于Pillow的程序化生成流水线构建,具备以下特性:
- **8种布局模板**(标准型、信函型、公告型、报告型、稀疏型、双栏型、备忘录型、纯文本型)
- **12种页面尺寸**,覆盖从A5到横向A4的规格
- **可调整的页边距、字体大小、行间距与缩进**
- **光度增强操作**(亮度调整、模糊、噪声、JPEG压缩、阴影、暗角、折痕模拟)
## 类别
| 编号 | 名称 | 描述 |
|----|------|-------------|
| 0 | `text_line` | 任意水平文本行 |
## 数据集统计信息
| 数据集拆分 | 图像数量 |
|-------|--------|
| 训练集 | 1955 |
| 验证集 | 345 |
## 数据结构
python
{
"image": Image(), # 图像对象
"image_id": Value("string"), # 示例:"kh_doc_000042"
"split": Value("string"), # 取值为 "train"(训练集)或 "val"(验证集)
"width": Value("int32"), # 图像宽度
"height": Value("int32"), # 图像高度
"annotations": Sequence({
"bbox": Sequence(Value("float32"), length=4), # [cx,cy,w,h] 归一化边界框
"cls_id": Value("int32"), # 类别编号,固定为0
}),
}
## 使用Hugging Face Datasets加载
python
from datasets import load_dataset
ds = load_dataset("Darayut/khmer-textline-dataset")
sample = ds["train"][0]
print(sample["image"]) # 加载PIL图像对象
print(sample["annotations"]) # 打印注释信息(列表形式的字典)
## 原生YOLO格式文件
`data/yolo_raw.zip` 包含原生YOLO目录结构(`images/`、`labels/`、`dataset.yaml`),可直接用于Ultralytics模型训练:
python
from huggingface_hub import hf_hub_download
import zipfile, pathlib
zip_path = hf_hub_download(
repo_id = "Darayut/khmer-textline-dataset",
filename = "data/yolo_raw.zip",
repo_type = "dataset",
)
with zipfile.ZipFile(zip_path) as zf:
zf.extractall("./khmer_doc_yolo")
# 训练命令:yolo train data=khmer_doc_yolo/dataset.yaml model=yolo11n.pt
提供机构:
Darayut



