MASC: Massive Arabic Speech Corpus
收藏ieee-dataport.org2025-01-22 收录
下载链接:
https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus
下载链接
链接失效反馈官方服务:
资源简介:
This paper describes the creation of the Massive Arabic Speech Corpus (MASC). MASC is a dataset that contains 1,000 hours of speech sampled at 16 kHz and crawled from over 700 YouTube channels. The dataset is multi-regional, multi-genre, and multi-dialect intended to advance the research and development of Arabic speech technology with a special emphasis on Arabic speech recognition. In addition to MASC, a pre-trained 3-gram language model and a pre-trained automatic speech recognition model are also developed and made available to interested researchers. To enhance the language model, a new and inclusive Arabic speech corpus is required, and thus, a dataset of 12 M unique Arabic words, originally crawled from Twitter, is also created and released.
本论文阐述了大规模阿拉伯语音语料库(MASC)的构建过程。MASC是一个包含1,000小时16kHz采样语音的语料库,其语音数据源自700多个YouTube频道。该语料库具有跨区域、跨流派和跨方言的特点,旨在推动阿拉伯语音技术的研发,尤其着重于阿拉伯语音识别领域。除MASC外,还开发并开放了预训练的三元语言模型和预训练的自动语音识别模型。为了提升语言模型,需要构建一个新颖且包容性的阿拉伯语音语料库,因此,一个包含12 M个独特阿拉伯单词的语料库亦被创建并发布,这些单词最初来源于Twitter。
提供机构:
IEEE Dataport



