nphearum/khmer-raw-text-3M-v2

Name: nphearum/khmer-raw-text-3M-v2
Creator: nphearum
Published: 2026-04-26 04:50:14
License: 暂无描述

Hugging Face2026-04-26 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/nphearum/khmer-raw-text-3M-v2

下载链接

链接失效反馈

官方服务：

资源简介：

nphearum/khmer-raw-text-3M-v2是一个大规模原始文本语料库，包含约200,000条完整记录和300万个文本片段，主要语言为高棉语（Khmer），同时包含英语内容。该数据集专为大语言模型（LLM）的预训练、持续预训练和领域适应而设计，旨在解决高棉语作为历史上代表性不足的低资源语言在现有语料库中稀缺的问题。数据集覆盖多个领域，如通用知识、教育材料、公共信息和双语混合内容，强调自然高棉语的使用和领域多样性，并保留了双语上下文以支持跨语言学习。数据由公开来源收集，经过基本清理（如去重和Unicode规范化），但未包含显式标签，适用于无监督或弱监督的语言模型训练。

nphearum/khmer-raw-text-3M-v2 is a large-scale raw text corpus containing approximately 200,000 completed records with 3 million text segments in Khmer, curated for large language model (LLM) pre-training, continued pre-training, and domain adaptation. The dataset emphasizes Khmer-language coverage, a historically underrepresented low-resource language, while retaining bilingual context for cross-lingual learning. It spans multiple domains, including general knowledge, educational material, public information, and mixed bilingual content, and is collected from publicly available sources with basic cleaning applied, but without explicit labels for supervised learning.

提供机构：

nphearum

5,000+

优质数据集

54 个

任务类型

进入经典数据集