Mozilla Common Voice (MCV)

Name: Mozilla Common Voice (MCV)
Creator: 美国亚美尼亚大学
Published: 2024-06-03 23:38:40
License: 暂无描述

arXiv2024-06-03 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2406.01446v1

下载链接

链接失效反馈

官方服务：

资源简介：

本研究聚焦于为低资源语言创建自动语音识别（ASR）训练数据集，特别以亚美尼亚语有声书为例。数据集Mozilla Common Voice（MCV）包含超过23小时的验证音频样本，涵盖多种语言和方言，特别适合ASR模型训练。该数据集通过众包方式收集，确保了音频的多样性和代表性。创建过程中，研究者采用了独特的数据处理和分割方法，以适应ASR训练的需求。此数据集的应用旨在解决低资源语言ASR系统的性能问题，通过提供丰富的训练数据，增强模型的识别能力和适应性。

This study focuses on creating automatic speech recognition (ASR) training datasets for low-resource languages, taking Armenian audiobooks as a specific example. The Mozilla Common Voice (MCV) dataset contains over 23 hours of validated audio samples covering multiple languages and dialects, which is particularly suitable for ASR model training. Collected via crowdsourcing, this dataset ensures audio diversity and representativeness. During the dataset creation process, researchers adopted unique data processing and segmentation methods to meet the requirements of ASR training. The application of this dataset aims to address the performance issues of ASR systems for low-resource languages, enhancing the model's recognition capability and adaptability by providing abundant training data.

提供机构：

美国亚美尼亚大学

创建时间：

2024-06-03

搜集汇总

数据集介绍

背景与挑战

背景概述

Mozilla Common Voice (MCV)是一个专注于低资源语言自动语音识别（ASR）训练的数据集，以亚美尼亚语为例，包含超过23小时的验证音频样本，涵盖多种语言和方言。该数据集通过众包方式收集，确保音频的多样性和代表性，并采用独特的数据处理和分割方法以适应ASR训练需求，旨在解决低资源语言ASR系统的性能问题，通过提供丰富数据提升模型的识别能力和适应性。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集