phonsobon/khmer-word-segmentation
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/phonsobon/khmer-word-segmentation
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- km
tags:
- khmer
- nlp
- text
- synthetic-data
dataset_info:
features:
- name: sentence
dtype: string
splits:
- name: train
num_bytes: 196364308
num_examples: 357681
- name: validation
num_bytes: 24547814
num_examples: 44710
- name: test
num_bytes: 24554074
num_examples: 44710
download_size: 47923892
dataset_size: 245466196
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
pretty_name: w
size_categories:
- 100K<n<1M
---
# Khmer Administrative Text Dataset 🇰🇭
## Overview
This dataset contains Khmer-language sentences that reflect formal administrative and government-style writing.
The dataset was **synthetically generated using the Gemini large language model**, developed by Google, to simulate official Khmer document language such as reports, letters, and institutional communication.
---
## Data Source Transparency
- Source: Synthetic data generated by Gemini (Google)
- Type: Artificially generated text (not collected from real government documents)
- Purpose: Research and experimentation in Khmer NLP
This dataset does **not represent real official documents**, but is designed to approximate their linguistic style.
---
## Dataset Structure
The dataset is split into three subsets:
- `train.txt`
- `validation.txt`
- `test.txt`
Each file contains **one sentence per line**.
---
## Data Format
Each line in the dataset is a single Khmer sentence:
Example:
សូម លោក ប្រធាន នាយកដ្ឋាន បច្ចេកវិទ្យា គមនាគមន៍ និង ព័ត៌មាន មេត្តា ពិនិត្យ និង អនុម័ត តាម នីតិវិធី រដ្ឋបាល លើ របាយការណ៍ វឌ្ឍនភាព ការងារ ប្រចាំ ខែ ។
---
## Language
- Language: Khmer (km)
- Script: Khmer Unicode (UTF-8 encoding)
---
## Use Cases
This dataset can be used for:
- Language Modeling (LLM training)
- Text Generation
- Khmer NLP experimentation
- Prompt engineering and synthetic data research
- Pretraining or fine-tuning language models
---
## Loading the Dataset
```python
from datasets import load_dataset
dataset = load_dataset("phonsobon/khmer-word-segmentation")
提供机构:
phonsobon



