Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

Name: Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi
Creator: 阿姆贝德卡大学
Published: 2022-06-27 01:28:38
License: 暂无描述

arXiv2022-06-27 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2206.12931v1

下载链接

链接失效反馈

官方服务：

资源简介：

本数据集名为‘Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi’，由阿姆贝德卡大学等机构创建，旨在为四种印度低资源语言（Awadhi, Bhojpuri, Braj, Magahi）提供语音数据。数据集总大小约18小时，包含约8154条记录，涵盖了日常生活中的语音数据，并进行了转录和语法标注。创建过程中采用了远程数据收集方法，特别是在COVID-19疫情期间，为低收入群体提供了额外收入。该数据集主要用于自动语音识别系统的开发，以支持这些语言的语音技术发展，解决这些语言在技术应用中的不足。

This dataset is named *Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi*. It was developed by institutions including Ambedkar University, with the goal of providing speech data for four low-resource Indian languages: Awadhi, Bhojpuri, Braj, and Magahi. The corpus has a total duration of approximately 18 hours, containing around 8,154 speech records covering daily conversational speech, and all data has been transcribed and grammatically annotated. Remote data collection methods were employed during its development, which provided additional income for low-income groups, especially during the COVID-19 pandemic. This corpus is primarily intended for the development of automatic speech recognition (ASR) systems, to support the advancement of speech technologies for these languages and address the technological gaps in their practical applications.

提供机构：

阿姆贝德卡大学

创建时间：

2022-06-27