five

phonsobon/khmer-word-segmentation

收藏
Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/phonsobon/khmer-word-segmentation
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - km tags: - khmer - nlp - text - synthetic-data dataset_info: features: - name: sentence dtype: string splits: - name: train num_bytes: 196364308 num_examples: 357681 - name: validation num_bytes: 24547814 num_examples: 44710 - name: test num_bytes: 24554074 num_examples: 44710 download_size: 47923892 dataset_size: 245466196 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* pretty_name: w size_categories: - 100K<n<1M --- # Khmer Administrative Text Dataset 🇰🇭 ## Overview This dataset contains Khmer-language sentences that reflect formal administrative and government-style writing. The dataset was **synthetically generated using the Gemini large language model**, developed by Google, to simulate official Khmer document language such as reports, letters, and institutional communication. --- ## Data Source Transparency - Source: Synthetic data generated by Gemini (Google) - Type: Artificially generated text (not collected from real government documents) - Purpose: Research and experimentation in Khmer NLP This dataset does **not represent real official documents**, but is designed to approximate their linguistic style. --- ## Dataset Structure The dataset is split into three subsets: - `train.txt` - `validation.txt` - `test.txt` Each file contains **one sentence per line**. --- ## Data Format Each line in the dataset is a single Khmer sentence: Example: សូម លោក ប្រធាន នាយកដ្ឋាន បច្ចេកវិទ្យា គមនាគមន៍ និង ព័ត៌មាន មេត្តា ពិនិត្យ និង អនុម័ត តាម នីតិវិធី រដ្ឋបាល លើ របាយការណ៍ វឌ្ឍនភាព ការងារ ប្រចាំ ខែ ។ --- ## Language - Language: Khmer (km) - Script: Khmer Unicode (UTF-8 encoding) --- ## Use Cases This dataset can be used for: - Language Modeling (LLM training) - Text Generation - Khmer NLP experimentation - Prompt engineering and synthetic data research - Pretraining or fine-tuning language models --- ## Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("phonsobon/khmer-word-segmentation")
提供机构:
phonsobon
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作