djsamseng/openslr-khmer-tts-asr
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/djsamseng/openslr-khmer-tts-asr
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- text-to-speech
- automatic-speech-recognition
language:
- km
tags:
- audio
- tts
- khmer
- asr
pretty_name: OpenSLR Khmer TTS ASR
size_categories:
- 1K<n<10K
---
# OpenSLR - Khmer TTS / ASR Dataset
```python
import datasets
ds = datasets.load_dataset("djsamseng/openslr-khmer-tts-asr")
ds["train"][1018]
```
```
{
'audio': {
array': array([3.05175781e-05, 3.05175781e-05, 3.05175781e-05, ...,
6.10351562e-05, 0.00000000e+00, 0.00000000e+00]),
'sampling_rate': 48000
},
'english': 'Current news in the country',
'khmer': 'ព័ត៌មាន ទាន់ ហេតុការណ៍ ក្នុង ប្រទេស',
'transliteration': 'poatemean toan hetokar knong protes',
'speaker': '3154',
'filename': 'khm_3154_0774534051.flac'
}
```
## Data
| Field | Description |
| ----------------- | ---------------------------------------------------- |
| `audio` | Audio data (flac 48khz) |
| `khmer` | Original khmer text |
| `english` | English translation generated by an LLM (early 2026) |
| `transliteration` | Generated romanization of khmer text |
| `speaker` | Speaker identifier prefix |
| `filename` | Audio filename |
- 16 speakers (all female), 2906 audio files, 3:58:00 (hours) total duration
- Audio collection process
- Volunteers (20-35 years old) read short sentences
- Each sentence contains 5 - 20 words.
- Sentences were either extracted from wikipedia, general websites, or were declarative
sentences created by native speakers.
- The recording were conducted in quiet environments: either a
sound studio or a quiet room with a soundproof booth.
- All audio files have passed through a QC process to ensure
good audio quality, absence of background noise, and match
between recorded audio and text transcript.
- [source](https://www.isca-archive.org/sltu_2018/sodimana18_sltu.pdf)
- Audio and khmer are from [OpenSLR](https://www.openslr.org/42/) under a CC BY-SA 4.0 license
- English translations are generated via a LLM/AI (early 2026).
- Transliterations generated via [khnlp](https://github.com/IDRI-LAB/Khmer-NLP-Tools)
```
@inproceedings{kjartansson-etal-tts-sltu2018,
title = {{A Step-by-Step Process for Building TTS Voices Using Open Source Data and Framework for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese}},
author = {Keshan Sodimana and Knot Pipatsrisawat and Linne Ha and Martin Jansche and Oddur Kjartansson and Pasindu De Silva and Supheakmungkol Sarin},
booktitle = {Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU)},
year = {2018},
address = {Gurugram, India},
month = aug,
pages = {66--70},
URL = {http://dx.doi.org/10.21437/SLTU.2018-14}
}
```
提供机构:
djsamseng



