Massive Arabic Speech Corpus (MASC)
收藏OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/Massive_Arabic_Speech_Corpus_MASC
下载链接
链接失效反馈官方服务:
资源简介:
本文介绍了大规模阿拉伯语语音语料库 (MASC) 的创建。MASC是一个数据集,包含以16 kHz采样的1,000小时语音,并从700多个YouTube频道中抓取。该数据集是多区域,多流派和多方言的,旨在促进阿拉伯语语音技术的研究和开发,特别强调阿拉伯语语音识别。除了MASC之外,还开发了预训练的3-gram语言模型和预训练的自动语音识别模型,并将其提供给感兴趣的研究人员。为了增强语言模型,需要新的和包容性的阿拉伯语语音语料库,因此,还创建并发布了最初从Twitter抓取的12 m独特阿拉伯语单词的数据集。
This paper introduces the creation of the Massive Arabic Speech Corpus (MASC). MASC is a dataset containing 1,000 hours of speech sampled at 16 kHz, scraped from over 700 YouTube channels. This dataset is multi-regional, multi-genre, and multi-dialectal, designed to advance research and development of Arabic speech technologies, with a particular emphasis on Arabic speech recognition. In addition to MASC, pre-trained 3-gram language models and pre-trained automatic speech recognition models have been developed and made available to interested researchers. To support enhanced language models that require new and inclusive Arabic speech corpora, a dataset of 12 million unique Arabic words initially scraped from Twitter has also been created and released.
提供机构:
OpenDataLab
创建时间:
2023-10-20
搜集汇总
数据集介绍

背景与挑战
背景概述
Massive Arabic Speech Corpus (MASC) 是一个大规模阿拉伯语语音数据集,包含1,000小时从700多个YouTube频道抓取的16kHz采样语音,具有多区域、多流派和多方言的特点,专门用于支持阿拉伯语语音识别等语音技术的研究与开发。该数据集还附带预训练的3-gram语言模型和自动语音识别模型,并额外提供了一个从Twitter抓取的12百万独特阿拉伯语单词的文本数据集,以增强语言模型的训练效果。
以上内容由遇见数据集搜集并总结生成



