musamagwaza23/NCHLT_ISIZULU_SPEECH

Name: musamagwaza23/NCHLT_ISIZULU_SPEECH
Creator: musamagwaza23
Published: 2026-04-12 10:02:10
License: 暂无描述

Hugging Face2026-04-12 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/musamagwaza23/NCHLT_ISIZULU_SPEECH

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: speaker_id dtype: string - name: age dtype: int64 - name: gender dtype: string - name: location dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: duration_seconds dtype: float64 - name: pdp_score dtype: float64 - name: transcript dtype: string splits: - name: train num_bytes: 3805425801 num_examples: 104354 download_size: 15130710006 dataset_size: 3805425801 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-3.0 tags: - ASR - Speech - IsiZulu - SouthAfrica --- # NCHLT isiZulu Auxiliary Speech Corpus | Msawenkosi Magwaza ## Overview This dataset is the isiZulu portion of the NCHLT (National Centre for Human Language Technology) Auxiliary Speech Corpus, originally developed by the **CSIR Meraka Institute** and **North-West University (NWU)** in South Africa. This Hugging Face upload was prepared and contributed by **[Msawenkosi Magwaza](https://www.linkedin.com/in/musamagwaza23)**, making the dataset directly accessible for machine learning workflows without manual download and preprocessing. ## Why This Dataset Matters isiZulu is the most widely spoken home language in South Africa, with over 12 million speakers. Despite this, it remains severely underrepresented in speech technology research. Most mainstream Automatic Speech Recognition (ASR) systems do not support isiZulu at all. This dataset exists to change that. It provides the foundation for training and evaluating ASR systems for isiZulu, contributing to language accessibility and digital inclusion for Zulu-speaking communities. ## Additional Files The repository also includes `nchlt_zul_aux.dict`, a pronunciation dictionary for isiZulu. Each entry maps a word to its phoneme sequence. Example: ``` ababa a b_< a b_< a ababali a b_< a b_< a l i ``` This file is useful for building language models, training grapheme-to-phoneme (G2P) systems, or any pipeline that requires phonetic representations of isiZulu words. --- ## How to Load the Dataset ```python from datasets import load_dataset ds = load_dataset("your-username/nchlt-zulu", split="train") print(ds[0]) ``` Each row's `audio` field contains: - `array`: the decoded audio signal as a NumPy float32 array - `sampling_rate`: 16000 - `path`: the original filename --- ## Original Dataset Source This dataset was originally published and distributed by SADiLaR (South African Centre for Digital Language Resources) through their resource repository. Original download: [https://repo.sadilar.org/handle/20.500.12185/275](https://repo.sadilar.org/handle/20.500.12185/275) --- ## Attribution and Credits Davel, M., Barnard, E., Badenhorst, J., van Heerden, C., de Waal, A. NCHLT isiZulu Speech Corpus. CSIR / North-West University, 2014. **Reference paper:** > N.J. de Vries, M.H. Davel, J. Badenhorst, W.D. Basson, F. de Wet, E. Barnard and A. de Waal, "A smartphone-based ASR data collection tool for under-resourced languages", *Speech Communication*, Volume 56, January 2014, pp 119-131. --- ## License This dataset is distributed under the **Creative Commons Attribution 3.0 Unported (CC BY 3.0)** license. You are free to use, share, and adapt this dataset for any purpose, including commercial use, as long as you give appropriate credit to the original creators listed above. Full license text: [https://creativecommons.org/licenses/by/3.0/](https://creativecommons.org/licenses/by/3.0/) ---

提供机构：

musamagwaza23

5,000+

优质数据集

54 个

任务类型

进入经典数据集