actableai/data-khmer
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/actableai/data-khmer
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- km
license: other
task_categories:
- automatic-speech-recognition
pretty_name: Khmer ASR v4 (actableai)
size_categories:
- 100K<n<1M
tags:
- khmer
- asr
- speech
- audio
dataset_info:
features:
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: text
dtype: string
- name: duration
dtype: float32
- name: source
dtype: string
splits:
- name: train
num_bytes: 56199107865
num_examples: 486046
- name: validation
num_bytes: 1181228871
num_examples: 9910
download_size: 59169192403
dataset_size: 57380336736
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
---
# Khmer ASR Dataset (v4)
A consolidated Khmer (ខ្មែរ) speech corpus used to train the actableai Khmer ASR
model (NeMo, v4 checkpoint). Combines 17 sub-corpora spanning call-center
recordings, broadcast/cultural content, public ASR corpora, and TTS-synthesised
speech.
## Summary
| Split | Utterances | Hours |
|------------|-----------:|--------:|
| train | 486,046 | 502.77 |
| validation | 9,910 | 10.20 |
| **total** | **495,956** | **512.97** |
- Sampling rate: 16 kHz, mono, PCM_16.
- Text: Khmer script (light normalization from the training pipeline only).
## Schema
| Field | Type | Description |
|-----------|-------------------|--------------------------------------------|
| audio | `Audio(16000 Hz)` | Mono waveform embedded in parquet. |
| text | `string` | Khmer transcript. |
| duration | `float32` | Seconds. |
| source | `string` | Sub-corpus tag (see below). |
## Composition by source
| Source | Train | Validation |
|----------------------|----------:|-----------:|
| metfone_v1_v3 | 173,836 | 3,547 |
| synth_full | 107,488 | 2,193 |
| ddd_cultural | 55,570 | 1,134 |
| sethisak_en_kh | 26,698 | 544 |
| sethisak_kh_en_v2 | 26,698 | 544 |
| rinabuoy_train | 24,833 | 506 |
| sethisak | 21,798 | 444 |
| djsamseng_large | 17,628 | 359 |
| km_corpus | 14,645 | 298 |
| sethisak_asr | 5,545 | 113 |
| openslr42 | 2,848 | 58 |
| kheng | 2,771 | 56 |
| mpwt | 2,017 | 41 |
| shunya | 1,630 | 33 |
| moses | 949 | 19 |
| rinabuoy_test | 750 | 15 |
| grkpp | 342 | 6 |
`synth_full` is TTS-synthesised speech produced by the actableai OmniVoice
pipeline — added in v4 to widen phonetic and prosodic coverage.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("actableai/data-khmer", split="train", streaming=True)
example = next(iter(ds))
print(example["text"], example["duration"], example["source"])
print(example["audio"]["sampling_rate"], example["audio"]["array"].shape)
```
For ASR evaluation, note that Khmer is a syllabic script without reliable
word boundaries — prefer **CER** over WER.
## Provenance & licensing
This is a research aggregation of multiple upstream sources. The licensing of
each sub-corpus follows its upstream project; downstream users are responsible
for verifying compliance with each source's terms before commercial use.
Synthesised audio (`synth_full`) is released under the same terms as the
upstream TTS voices it was produced from.
语言:
- km
许可证:其他
任务类别:
- 自动语音识别
友好展示名称:高棉语自动语音识别数据集v4(actableai)
样本规模区间:
- 100K<n<1M
标签:
- 高棉语
- 自动语音识别(ASR)
- 语音
- 音频
数据集信息:
特征字段:
- 名称:音频
数据类型:
音频:
采样率:16000Hz
- 名称:文本
数据类型:字符串
- 名称:时长
数据类型:float32
- 名称:来源
数据类型:字符串
数据集划分:
- 名称:训练集
字节数:56199107865
样本数:486046
- 名称:验证集
字节数:1181228871
样本数:9910
下载大小:59169192403
数据集总大小:57380336736
配置项:
- 配置名称:默认
数据文件:
- 划分:训练集
路径:data/train-*
- 划分:验证集
路径:data/validation-*
# 高棉语自动语音识别数据集(v4)
本数据集为整合型高棉语(ខ្មែរ)语音语料库,用于训练actableai高棉语自动语音识别模型(NeMo v4 检查点)。该数据集整合了17个子语料库,涵盖呼叫中心录音、广播/文化内容、公开自动语音识别语料库以及文本到语音(Text-To-Speech, TTS)合成语音。
## 统计概览
| 数据集划分 | 语句数 | 时长(小时) |
|------------|---------:|------------:|
| 训练集 | 486,046 | 502.77 |
| 验证集 | 9,910 | 10.20 |
| **总计** | **495,956** | **512.97** |
- 采样参数:16kHz采样率,单声道,PCM_16编码。
- 文本规范:采用高棉文,仅在训练流程中进行轻度标准化处理。
## 数据结构
| 字段名称 | 数据类型 | 描述说明 |
|----------|-------------------|--------------------------------------------|
| 音频 | `Audio(16000 Hz)` | 存储于Parquet文件中的单声道波形数据。 |
| 文本 | `string` | 高棉语语音转写文本。 |
| 时长 | `float32` | 语音时长,单位为秒。 |
| 来源 | `string` | 子语料库标签(详见下文)。 |
## 按来源的语料构成
| 来源标识 | 训练集样本数 | 验证集样本数 |
|----------------------|-------------:|------------:|
| metfone_v1_v3 | 173,836 | 3,547 |
| synth_full | 107,488 | 2,193 |
| ddd_cultural | 55,570 | 1,134 |
| sethisak_en_kh | 26,698 | 544 |
| sethisak_kh_en_v2 | 26,698 | 544 |
| rinabuoy_train | 24,833 | 506 |
| sethisak | 21,798 | 444 |
| djsamseng_large | 17,628 | 359 |
| km_corpus | 14,645 | 298 |
| sethisak_asr | 5,545 | 113 |
| openslr42 | 2,848 | 58 |
| kheng | 2,771 | 56 |
| mpwt | 2,017 | 41 |
| shunya | 1,630 | 33 |
| moses | 949 | 19 |
| rinabuoy_test | 750 | 15 |
| grkpp | 342 | 6 |
其中`synth_full`为通过actableai OmniVoice流程生成的TTS合成语音,在v4版本中新增,用于拓展语音的音位与韵律覆盖范围。
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("actableai/data-khmer", split="train", streaming=True)
example = next(iter(ds))
print(example["text"], example["duration"], example["source"])
print(example["audio"]["sampling_rate"], example["audio"]["array"].shape)
进行自动语音识别评估时需注意:高棉语为音节文字,无明确的词边界划分,因此相较于词错误率(Word Error Rate, WER),更推荐使用字符错误率(Character Error Rate, CER)作为评估指标。
## 来源与许可证
本数据集为多上游来源的研究级整合语料库。各子语料库的许可证遵循其上游项目的规定;下游使用者在进行商业使用前,需自行验证符合各来源的条款要求。合成音频(`synth_full`)的发布条款与其所使用的上游TTS语音的条款保持一致。
提供机构:
actableai



