hirundo-io/MASC
收藏Hugging Face2025-06-17 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/hirundo-io/MASC
下载链接
链接失效反馈官方服务:
资源简介:
MASC数据集是一个包含1000小时16kHz采样的阿拉伯语音的数据集,该数据集从700多个YouTube频道爬取而来,具有多地区、多语种、多方言的特点,旨在推动阿拉伯语音技术的研究与开发,特别是阿拉伯语音识别。除了MASC,还开发了一个预训练的3-gram语言模型和一个预训练的自动语音识别模型,供有兴趣的研究者使用。为了增强语言模型,还创建并发布了一个包含1200万独特阿拉伯单词的数据集,这些单词最初从Twitter爬取。
The MASC dataset is a corpus consisting of 1,000 hours of 16kHz sampled Arabic speech, crawled from over 700 YouTube channels, characterized by being multi-regional, multi-genre, and multi-dialect. It is intended to advance the research and development of Arabic speech technology, with a focus on Arabic speech recognition. In addition to MASC, a pre-trained 3-gram language model and a pre-trained automatic speech recognition model have been developed and made available to interested researchers. To enhance the language model, a new dataset containing 12 million unique Arabic words, originally crawled from Twitter, has also been created and released.
提供机构:
hirundo-io



