Huqariq
收藏arXiv2022-07-12 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2207.05498v1
下载链接
链接失效反馈官方服务:
资源简介:
Huqariq是一个多语言的语音数据集,专门收集了秘鲁本土语言的语音数据,旨在通过技术手段保护濒危语言。该数据集由超过500名志愿者参与录制,包含了220小时的转录音频,是目前秘鲁本土语言中最大的语音数据集。Huqariq主要用于自动语音识别、语言识别和文本到语音工具的开发。数据集的创建采用了众包方法,预计到2022年底将涵盖秘鲁48种本土语言中的20种。该数据集的应用领域主要集中在语言技术研究和语言保护,特别是在解决低资源语言面临的技术和保护问题。
Huqariq is a multilingual speech dataset specifically curated for Peruvian indigenous languages, aiming to protect endangered languages through technological means. It has recruited over 500 volunteers for recording, and contains 220 hours of transcribed audio, making it the largest speech dataset for Peruvian indigenous languages to date. Huqariq is primarily intended for the development of automatic speech recognition, language identification and text-to-speech tools. The dataset was created using a crowdsourcing approach, and it is projected to cover 20 out of the 48 indigenous languages in Peru by the end of 2022. Its application fields mainly focus on language technology research and language conservation, especially in addressing the technical and conservation challenges faced by low-resource languages.
提供机构:
庞蒂夫天主教大学
创建时间:
2022-07-12



