CORAA (Corpus of Annotated Audios) v1

Name: CORAA (Corpus of Annotated Audios) v1
Creator: 巴西联邦理工大学 - 帕拉纳
Published: 2021-11-18 19:59:53
License: 暂无描述

arXiv2021-11-18 更新2024-06-21 收录

下载链接：

https://github.com/nilc-nlp/CORAA

下载链接

链接失效反馈

官方服务：

资源简介：

CORAA（标注音频语料库）v1是由巴西联邦理工大学 - 帕拉纳等多个研究机构合作创建的大型公开数据集，专注于巴西葡萄牙语的自动语音识别。该数据集包含290.77小时的经过验证的音频-转录对，涵盖自发和准备好的演讲，旨在解决现有资源中自发语音数据的不足。CORAA v1由五个子语料库组成，包括ALIP、C-ORAL Brasil I、NURC-Recife、SP2010和TEDx葡萄牙语演讲，这些数据来源于不同的学术项目和TEDx活动。数据集的创建过程涉及文本标准化、音频与转录的强制对齐以及人工验证等步骤，确保数据质量。CORAA v1的应用领域广泛，包括但不限于提高ASR模型在自发语音和噪音环境中的性能，以及推动葡萄牙语ASR技术的研究和发展。

CORAA (Annotated Audio Corpus) v1 is a large-scale open dataset collaboratively created by multiple research institutions including the Federal University of Technology - Paraná, Brazil, focusing on automatic speech recognition (ASR) for Brazilian Portuguese. This dataset contains 290.77 hours of validated audio-transcription pairs covering both spontaneous and prepared speech, and is designed to address the shortage of spontaneous speech data in existing resources. CORAA v1 consists of five sub-corpora, namely ALIP, C-ORAL Brasil I, NURC-Recife, SP2010, and TEDx Portuguese speeches, which are derived from various academic projects and TEDx events. The dataset creation process includes steps such as text normalization, forced alignment between audio and transcriptions, and manual verification to guarantee data quality. CORAA v1 has a wide range of application scenarios, including but not limited to improving the performance of ASR models in spontaneous speech and noisy environments, as well as advancing the research and development of Portuguese ASR technologies.

提供机构：

巴西联邦理工大学 - 帕拉纳

创建时间：

2021-10-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集