Bengali Common Voice Speech Dataset

Name: Bengali Common Voice Speech Dataset
Creator: Bengali.AI
Published: 2022-06-29 23:34:23
License: 暂无描述

arXiv2022-06-29 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2206.14053v2

下载链接

链接失效反馈

官方服务：

资源简介：

Bengali Common Voice Speech Dataset是由Bengali.AI创建的一个大规模自动语音识别数据集，包含超过400小时的转录音频记录，来自孟加拉国和印度的社区。该数据集通过众包方式收集，旨在解决由于缺乏多样化的开源数据集而阻碍的孟加拉语语音识别系统的发展问题。数据集内容丰富，包括231,120个样本，每个样本都附有句子注释和其他元数据，如年龄、性别和口音。数据集的创建过程涉及从Wikipedia随机爬取句子作为录音提示，并通过Mozilla Common Voice平台进行收集和验证。该数据集主要应用于自动语音识别（ASR）领域，旨在提高孟加拉语语音识别系统的性能和多样性。

The Bengali Common Voice Speech Dataset is a large-scale automatic speech recognition dataset developed by Bengali.AI, containing over 400 hours of transcribed audio recordings sourced from communities in Bangladesh and India. Collected through crowdsourcing, it aims to resolve the problem that the development of Bengali speech recognition systems has been hampered by the shortage of diverse open-source datasets. The dataset comprises a total of 231,120 samples, each paired with sentence-level annotations and additional metadata including age, gender, and accent. The development of this dataset involved randomly crawling sentences from Wikipedia as recording prompts, and collecting and validating the data via the Mozilla Common Voice platform. This dataset is primarily applied in the automatic speech recognition (ASR) domain, with the goal of improving the performance and diversity of Bengali speech recognition systems.

提供机构：

Bengali.AI

创建时间：

2022-06-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集