Massive Arabic Speech Corpus (MASC)
收藏DataCite Commons2021-08-18 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/open-access/massive-arabic-speech-corpus-masc
下载链接
链接失效反馈官方服务:
资源简介:
This paper releases and describes the creation of the Massive Arabic Speech Corpus (MASC). This corpus is a dataset that contains 1,000 hours of speech sampled at 16~kHz and crawled from over 700 YouTube channels. MASC is multi-regional, multi-genre, and multi-dialect dataset that is intended to advance the research and development of Arabic speech technology with the special emphasis on Arabic speech recognition. In addition to MASC, a pre-trained 3-gram language model and a pre-trained automatic speech recognition model are also developed and made available for interested researches. For a better language model, a new and unified Arabic speech corpus is required, and thus, a dataset of 12~M unique Arabic words is created and released. To make practical and convenient use of MASC, the whole dataset is stratified based on dialect into clean and noisy portions. Each of the two portions is then stratified and divided into three subsets: development, test, and training sets. The best word error rate achieved by the speech recognition model is 19.8% for the clean development set and 21.8% for the clean test set.
提供机构:
IEEE DataPort
创建时间:
2021-08-18
搜集汇总
数据集介绍

背景与挑战
背景概述
MASC是一个大规模的阿拉伯语语音语料库,包含1000小时的16kHz采样语音数据,从700多个YouTube频道爬取,具有多地区、多类型和多方言的特点,专门用于推动阿拉伯语语音识别研究。数据集还提供了预训练的3-gram语言模型和自动语音识别模型,以及一个来自Twitter的12M独特阿拉伯语单词数据集,以增强语言建模能力。整体数据集结构完整,包括音频、字幕和训练评估子集,适用于语音识别模型的训练和评估。
以上内容由遇见数据集搜集并总结生成



