Massive Arabic Speech Corpus (MASC)

Name: Massive Arabic Speech Corpus (MASC)
Creator: Dawas, Maha; Al-Barham, Muhammad; Alsharkawi, Adham; Abandah, Gheith; Al-Fetyani, Mohammad
Published: 2021-08-18 00:00:00
License: 暂无描述

IEEE2021-08-18 更新2026-04-17 收录

下载链接：

https://ieee-dataport.org/open-access/massive-arabic-speech-corpus-masc

下载链接

链接失效反馈

官方服务：

资源简介：

This paper releases and describes the creation of the Massive Arabic Speech Corpus (MASC). This corpus is a dataset that contains 1,000 hours of speech sampled at 16~kHz and crawled from over 700 YouTube channels. MASC is multi-regional, multi-genre, and multi-dialect dataset that is intended to advance the research and development of Arabic speech technology with the special emphasis on Arabic speech recognition. In addition to MASC, a pre-trained 3-gram language model and a pre-trained automatic speech recognition model are also developed and made available for interested researches. For a better language model, a new and unified Arabic speech corpus is required, and thus, a dataset of 12~M unique Arabic words is created and released. To make practical and convenient use of MASC, the whole dataset is stratified based on dialect into clean and noisy portions. Each of the two portions is then stratified and divided into three subsets: development, test, and training sets. The best word error rate achieved by the speech recognition model is 19.8% for the clean development set and 21.8% for the clean test set.

提供机构：

Dawas, Maha; Al-Barham, Muhammad; Alsharkawi, Adham; Abandah, Gheith; Al-Fetyani, Mohammad

创建时间：

2021-08-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集