musamagwaza23/NCHLT_ISIZULU_SPEECH
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/musamagwaza23/NCHLT_ISIZULU_SPEECH
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: speaker_id
dtype: string
- name: age
dtype: int64
- name: gender
dtype: string
- name: location
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: duration_seconds
dtype: float64
- name: pdp_score
dtype: float64
- name: transcript
dtype: string
splits:
- name: train
num_bytes: 3805425801
num_examples: 104354
download_size: 15130710006
dataset_size: 3805425801
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-3.0
tags:
- ASR
- Speech
- IsiZulu
- SouthAfrica
---
# NCHLT isiZulu Auxiliary Speech Corpus | Msawenkosi Magwaza
## Overview
This dataset is the isiZulu portion of the NCHLT (National Centre for Human Language Technology) Auxiliary Speech Corpus, originally developed by the **CSIR Meraka Institute** and **North-West University (NWU)** in South Africa.
This Hugging Face upload was prepared and contributed by **[Msawenkosi Magwaza](https://www.linkedin.com/in/musamagwaza23)**, making the dataset directly accessible for machine learning workflows without manual download and preprocessing.
## Why This Dataset Matters
isiZulu is the most widely spoken home language in South Africa, with over 12 million speakers. Despite this, it remains severely underrepresented in speech technology research. Most mainstream Automatic Speech Recognition (ASR) systems do not support isiZulu at all.
This dataset exists to change that. It provides the foundation for training and evaluating ASR systems for isiZulu, contributing to language accessibility and digital inclusion for Zulu-speaking communities.
## Additional Files
The repository also includes `nchlt_zul_aux.dict`, a pronunciation dictionary for isiZulu. Each entry maps a word to its phoneme sequence.
Example:
```
ababa a b_< a b_< a
ababali a b_< a b_< a l i
```
This file is useful for building language models, training grapheme-to-phoneme (G2P) systems, or any pipeline that requires phonetic representations of isiZulu words.
---
## How to Load the Dataset
```python
from datasets import load_dataset
ds = load_dataset("your-username/nchlt-zulu", split="train")
print(ds[0])
```
Each row's `audio` field contains:
- `array`: the decoded audio signal as a NumPy float32 array
- `sampling_rate`: 16000
- `path`: the original filename
---
## Original Dataset Source
This dataset was originally published and distributed by SADiLaR (South African Centre for Digital Language Resources) through their resource repository.
Original download: [https://repo.sadilar.org/handle/20.500.12185/275](https://repo.sadilar.org/handle/20.500.12185/275)
---
## Attribution and Credits
Davel, M., Barnard, E., Badenhorst, J., van Heerden, C., de Waal, A.
NCHLT isiZulu Speech Corpus.
CSIR / North-West University, 2014.
**Reference paper:**
> N.J. de Vries, M.H. Davel, J. Badenhorst, W.D. Basson, F. de Wet, E. Barnard and A. de Waal, "A smartphone-based ASR data collection tool for under-resourced languages", *Speech Communication*, Volume 56, January 2014, pp 119-131.
---
## License
This dataset is distributed under the **Creative Commons Attribution 3.0 Unported (CC BY 3.0)** license.
You are free to use, share, and adapt this dataset for any purpose, including commercial use, as long as you give appropriate credit to the original creators listed above.
Full license text: [https://creativecommons.org/licenses/by/3.0/](https://creativecommons.org/licenses/by/3.0/)
---
提供机构:
musamagwaza23



