Name: actableai/data-khmer
Creator: actableai
Published: 2026-04-21 09:31:28
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/actableai/data-khmer

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - km license: other task_categories: - automatic-speech-recognition pretty_name: Khmer ASR v4 (actableai) size_categories: - 100K<n<1M tags: - khmer - asr - speech - audio dataset_info: features: - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: duration dtype: float32 - name: source dtype: string splits: - name: train num_bytes: 56199107865 num_examples: 486046 - name: validation num_bytes: 1181228871 num_examples: 9910 download_size: 59169192403 dataset_size: 57380336736 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* --- # Khmer ASR Dataset (v4) A consolidated Khmer (ខ្មែរ) speech corpus used to train the actableai Khmer ASR model (NeMo, v4 checkpoint). Combines 17 sub-corpora spanning call-center recordings, broadcast/cultural content, public ASR corpora, and TTS-synthesised speech. ## Summary | Split | Utterances | Hours | |------------|-----------:|--------:| | train | 486,046 | 502.77 | | validation | 9,910 | 10.20 | | **total** | **495,956** | **512.97** | - Sampling rate: 16 kHz, mono, PCM_16. - Text: Khmer script (light normalization from the training pipeline only). ## Schema | Field | Type | Description | |-----------|-------------------|--------------------------------------------| | audio | `Audio(16000 Hz)` | Mono waveform embedded in parquet. | | text | `string` | Khmer transcript. | | duration | `float32` | Seconds. | | source | `string` | Sub-corpus tag (see below). | ## Composition by source | Source | Train | Validation | |----------------------|----------:|-----------:| | metfone_v1_v3 | 173,836 | 3,547 | | synth_full | 107,488 | 2,193 | | ddd_cultural | 55,570 | 1,134 | | sethisak_en_kh | 26,698 | 544 | | sethisak_kh_en_v2 | 26,698 | 544 | | rinabuoy_train | 24,833 | 506 | | sethisak | 21,798 | 444 | | djsamseng_large | 17,628 | 359 | | km_corpus | 14,645 | 298 | | sethisak_asr | 5,545 | 113 | | openslr42 | 2,848 | 58 | | kheng | 2,771 | 56 | | mpwt | 2,017 | 41 | | shunya | 1,630 | 33 | | moses | 949 | 19 | | rinabuoy_test | 750 | 15 | | grkpp | 342 | 6 | `synth_full` is TTS-synthesised speech produced by the actableai OmniVoice pipeline — added in v4 to widen phonetic and prosodic coverage. ## Usage ```python from datasets import load_dataset ds = load_dataset("actableai/data-khmer", split="train", streaming=True) example = next(iter(ds)) print(example["text"], example["duration"], example["source"]) print(example["audio"]["sampling_rate"], example["audio"]["array"].shape) ``` For ASR evaluation, note that Khmer is a syllabic script without reliable word boundaries — prefer **CER** over WER. ## Provenance & licensing This is a research aggregation of multiple upstream sources. The licensing of each sub-corpus follows its upstream project; downstream users are responsible for verifying compliance with each source's terms before commercial use. Synthesised audio (`synth_full`) is released under the same terms as the upstream TTS voices it was produced from.

语言： - km 许可证：其他任务类别： - 自动语音识别友好展示名称：高棉语自动语音识别数据集v4（actableai）样本规模区间： - 100K<n<1M 标签： - 高棉语 - 自动语音识别（ASR） - 语音 - 音频数据集信息：特征字段： - 名称：音频数据类型：音频：采样率：16000Hz - 名称：文本数据类型：字符串 - 名称：时长数据类型：float32 - 名称：来源数据类型：字符串数据集划分： - 名称：训练集字节数：56199107865 样本数：486046 - 名称：验证集字节数：1181228871 样本数：9910 下载大小：59169192403 数据集总大小：57380336736 配置项： - 配置名称：默认数据文件： - 划分：训练集路径：data/train-* - 划分：验证集路径：data/validation-* # 高棉语自动语音识别数据集（v4）本数据集为整合型高棉语（ខ្មែរ）语音语料库，用于训练actableai高棉语自动语音识别模型（NeMo v4 检查点）。该数据集整合了17个子语料库，涵盖呼叫中心录音、广播/文化内容、公开自动语音识别语料库以及文本到语音（Text-To-Speech, TTS）合成语音。 ## 统计概览 | 数据集划分 | 语句数 | 时长（小时） | |------------|---------:|------------:| | 训练集 | 486,046 | 502.77 | | 验证集 | 9,910 | 10.20 | | **总计** | **495,956** | **512.97** | - 采样参数：16kHz采样率，单声道，PCM_16编码。 - 文本规范：采用高棉文，仅在训练流程中进行轻度标准化处理。 ## 数据结构 | 字段名称 | 数据类型 | 描述说明 | |----------|-------------------|--------------------------------------------| | 音频 | `Audio(16000 Hz)` | 存储于Parquet文件中的单声道波形数据。 | | 文本 | `string` | 高棉语语音转写文本。 | | 时长 | `float32` | 语音时长，单位为秒。 | | 来源 | `string` | 子语料库标签（详见下文）。 | ## 按来源的语料构成 | 来源标识 | 训练集样本数 | 验证集样本数 | |----------------------|-------------:|------------:| | metfone_v1_v3 | 173,836 | 3,547 | | synth_full | 107,488 | 2,193 | | ddd_cultural | 55,570 | 1,134 | | sethisak_en_kh | 26,698 | 544 | | sethisak_kh_en_v2 | 26,698 | 544 | | rinabuoy_train | 24,833 | 506 | | sethisak | 21,798 | 444 | | djsamseng_large | 17,628 | 359 | | km_corpus | 14,645 | 298 | | sethisak_asr | 5,545 | 113 | | openslr42 | 2,848 | 58 | | kheng | 2,771 | 56 | | mpwt | 2,017 | 41 | | shunya | 1,630 | 33 | | moses | 949 | 19 | | rinabuoy_test | 750 | 15 | | grkpp | 342 | 6 | 其中`synth_full`为通过actableai OmniVoice流程生成的TTS合成语音，在v4版本中新增，用于拓展语音的音位与韵律覆盖范围。 ## 使用方法 python from datasets import load_dataset ds = load_dataset("actableai/data-khmer", split="train", streaming=True) example = next(iter(ds)) print(example["text"], example["duration"], example["source"]) print(example["audio"]["sampling_rate"], example["audio"]["array"].shape) 进行自动语音识别评估时需注意：高棉语为音节文字，无明确的词边界划分，因此相较于词错误率（Word Error Rate, WER），更推荐使用字符错误率（Character Error Rate, CER）作为评估指标。 ## 来源与许可证本数据集为多上游来源的研究级整合语料库。各子语料库的许可证遵循其上游项目的规定；下游使用者在进行商业使用前，需自行验证符合各来源的条款要求。合成音频（`synth_full`）的发布条款与其所使用的上游TTS语音的条款保持一致。

应用场景：