Acoustic models of Brazilian Portuguese Speech based on Neural Transformers - Pretraining Datasets raw audios from CORAA

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/6794923

下载链接

链接失效反馈

官方服务：

资源简介：

This repository contains all the pretraining datasets used in the paper: Acoustic models of Brazilian Portuguese Speech based on Neural Transformers by Marcelo Gauy and Marcelo Finger. These datasets are part of a collection of datasets from the TaRSila project (see https://sites.google.com/view/tarsila-c4ai). The audios published here were in part also published with annotations and transcriptions as the CORAA dataset (see https://github.com/nilc-nlp/CORAA). Here we publish the original raw audios from the following datasets (without transcriptions) - ALIP, C-Oral, SP2010, NURC-Recife, NURC-São Paulo and Programa Certas Palavras. In total, the datasets contain about 800 hours of Brazilian Portuguese Speech. The audios have been converted to mp3 to facilitate the upload. ALIP, C-Oral and SP2010 are integrally contained in one file each. Programa Certas Palavras and NURC-Recife are split in 3 parts each, while NURC-SP is split in 7 parts of roughly equal size. More information on the datasets can be found in the paper Acoustic models of Brazilian Portuguese Speech based on Neural Transformers as well as on the original references which created these datasets.

创建时间：

2022-07-12