five

dsfsi-anv/multilingual-nchlt-dataset

收藏
Hugging Face2026-01-06 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/dsfsi-anv/multilingual-nchlt-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
这是NCHLT辅助语音语料库的合并多语言版本,由比勒陀利亚大学的数据科学社会影响研究小组(DSFSI)编译,旨在便于更轻松的基准测试和多语言语音识别研究。原始辅助数据是在南非国家人类语言技术中心(NCHLT)项目期间为南非的11种官方语言收集的,并于2019年由SADiLaR公开提供。此合并数据集将所有11种语言的数据集统一格式整合在一起。该数据集为每种语言提供20至170小时的语音数据及正字法转录,总计超过1,420小时的语音数据。数据最初使用名为Woefzela的智能手机应用程序收集,包括约3,400名讲者的录音。数据集支持的语言包括南非荷兰语、南非英语、恩德贝勒语、科萨语、祖鲁语、北索托语、南索托语、茨瓦纳语、斯瓦蒂语、文达语和聪加语。

This is a combined multilingual version of the NCHLT Auxiliary Speech Corpus, compiled by the Data Science for Social Impact (DSFSI) research group at the University of Pretoria to facilitate easier benchmarking and multi-language speech recognition research. The original auxiliary data was collected during the National Centre for Human Language Technology (NCHLT) project for the 11 official languages of South Africa, and was made publicly available by SADiLaR in 2019. This combined dataset brings together all 11 language datasets into a unified format. The dataset provides between 20 and 170 hours of speech data per language along with orthographic transcriptions, totaling over 1,420 hours of speech across all languages. The data was originally collected using a smartphone application called Woefzela and includes recordings from approximately 3,400 speakers across all languages. Supported languages include Afrikaans, South African English, isiNdebele, isiXhosa, isiZulu, Sepedi, Sesotho, Setswana, Siswati, Tshivenda, and Xitsonga.
提供机构:
dsfsi-anv
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作