dsfsi-anv/multilingual-nchlt-dataset

Name: dsfsi-anv/multilingual-nchlt-dataset
Creator: dsfsi-anv
Published: 2026-01-06 05:25:21
License: 暂无描述

Hugging Face2026-01-06 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/dsfsi-anv/multilingual-nchlt-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

这是NCHLT辅助语音语料库的合并多语言版本，由比勒陀利亚大学的数据科学社会影响研究小组（DSFSI）编译，旨在便于更轻松的基准测试和多语言语音识别研究。原始辅助数据是在南非国家人类语言技术中心（NCHLT）项目期间为南非的11种官方语言收集的，并于2019年由SADiLaR公开提供。此合并数据集将所有11种语言的数据集统一格式整合在一起。该数据集为每种语言提供20至170小时的语音数据及正字法转录，总计超过1,420小时的语音数据。数据最初使用名为Woefzela的智能手机应用程序收集，包括约3,400名讲者的录音。数据集支持的语言包括南非荷兰语、南非英语、恩德贝勒语、科萨语、祖鲁语、北索托语、南索托语、茨瓦纳语、斯瓦蒂语、文达语和聪加语。

This is a combined multilingual version of the NCHLT Auxiliary Speech Corpus, compiled by the Data Science for Social Impact (DSFSI) research group at the University of Pretoria to facilitate easier benchmarking and multi-language speech recognition research. The original auxiliary data was collected during the National Centre for Human Language Technology (NCHLT) project for the 11 official languages of South Africa, and was made publicly available by SADiLaR in 2019. This combined dataset brings together all 11 language datasets into a unified format. The dataset provides between 20 and 170 hours of speech data per language along with orthographic transcriptions, totaling over 1,420 hours of speech across all languages. The data was originally collected using a smartphone application called Woefzela and includes recordings from approximately 3,400 speakers across all languages. Supported languages include Afrikaans, South African English, isiNdebele, isiXhosa, isiZulu, Sepedi, Sesotho, Setswana, Siswati, Tshivenda, and Xitsonga.

提供机构：

dsfsi-anv

5,000+

优质数据集

54 个

任务类型

进入经典数据集