phonsobon/khmer-word-segmentation

Name: phonsobon/khmer-word-segmentation
Creator: phonsobon
Published: 2026-04-28 02:12:47
License: 暂无描述

Hugging Face2026-04-28 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/phonsobon/khmer-word-segmentation

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - km tags: - khmer - nlp - text - synthetic-data dataset_info: features: - name: sentence dtype: string splits: - name: train num_bytes: 196364308 num_examples: 357681 - name: validation num_bytes: 24547814 num_examples: 44710 - name: test num_bytes: 24554074 num_examples: 44710 download_size: 47923892 dataset_size: 245466196 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* pretty_name: w size_categories: - 100K<n<1M --- # Khmer Administrative Text Dataset 🇰🇭 ## Overview This dataset contains Khmer-language sentences that reflect formal administrative and government-style writing. The dataset was **synthetically generated using the Gemini large language model**, developed by Google, to simulate official Khmer document language such as reports, letters, and institutional communication. --- ## Data Source Transparency - Source: Synthetic data generated by Gemini (Google) - Type: Artificially generated text (not collected from real government documents) - Purpose: Research and experimentation in Khmer NLP This dataset does **not represent real official documents**, but is designed to approximate their linguistic style. --- ## Dataset Structure The dataset is split into three subsets: - `train.txt` - `validation.txt` - `test.txt` Each file contains **one sentence per line**. --- ## Data Format Each line in the dataset is a single Khmer sentence: Example: សូម លោក ប្រធាន នាយកដ្ឋាន បច្ចេកវិទ្យា គមនាគមន៍ និង ព័ត៌មាន មេត្តា ពិនិត្យ និង អនុម័ត តាម នីតិវិធី រដ្ឋបាល លើ របាយការណ៍ វឌ្ឍនភាព ការងារ ប្រចាំ ខែ ។ --- ## Language - Language: Khmer (km) - Script: Khmer Unicode (UTF-8 encoding) --- ## Use Cases This dataset can be used for: - Language Modeling (LLM training) - Text Generation - Khmer NLP experimentation - Prompt engineering and synthetic data research - Pretraining or fine-tuning language models --- ## Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("phonsobon/khmer-word-segmentation")

提供机构：

phonsobon

5,000+

优质数据集

54 个

任务类型

进入经典数据集