Racoci/CORAA-v1.1

Name: Racoci/CORAA-v1.1
Creator: Racoci
Published: 2024-06-01 04:23:00
License: 暂无描述

Hugging Face2024-06-01 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/Racoci/CORAA-v1.1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-nd-4.0 dataset_info: features: - name: file_path dtype: string - name: task dtype: string - name: variety dtype: string - name: dataset dtype: string - name: accent dtype: string - name: speech_genre dtype: string - name: speech_style dtype: string - name: up_votes dtype: int64 - name: down_votes dtype: int64 - name: votes_for_hesitation dtype: float64 - name: votes_for_filled_pause dtype: float64 - name: votes_for_noise_or_low_voice dtype: float64 - name: votes_for_second_voice dtype: float64 - name: votes_for_no_identified_problem dtype: float64 - name: text dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 63113404687.162 num_examples: 382258 - name: dev num_bytes: 1363924625 num_examples: 7522 - name: test num_bytes: 2594334946 num_examples: 12676 download_size: 66914186143 dataset_size: 67071664258.162 configs: - config_name: default data_files: - split: train path: data/train-* - split: dev path: data/dev-* - split: test path: data/test-* task_categories: - automatic-speech-recognition - text-to-speech language: - pt pretty_name: coraa size_categories: - 1K<n<10K --- # CORAA-v1.1 [CORAA-v1.1](https://github.com/nilc-nlp/CORAA) is a publicly available dataset for Automatic Speech Recognition (ASR) in the Brazilian Portuguese language containing 290.77 hours of audios and their respective transcriptions (400k+ segmented audios). The dataset is composed of audios of 5 original projects: - ALIP (Gonçalves, 2019) - C-ORAL Brazil (Raso and Mello, 2012) - NURC-Recife (Oliviera Jr., 2016) - SP-2010 (Mendes and Oushiro, 2012) - TEDx talks (talks in Portuguese) The audios were either validated by annotators or transcripted for the first time aiming at the ASR task. ## LICENSE [Attribution-NonCommercial-NoDerivatives 4.0 International](https://raw.githubusercontent.com/nilc-nlp/CORAA/main/LICENSE) ## Metadata - file_path: the path to an audio file - task: transcription (annotators revised original transcriptions); annotation (annotators classified the audio-transcription pair according to votes_for_* metrics); annotation_and_transcription (both tasks were performed) - variety: European Portuguese (PT_PT) or Brazilian Portuguese (PT_BR) - dataset: one of five datasets (ALIP, C-oral Brasil, NURC-RE, SP2010, TEDx Portuguese) - accent: one of four accents (Minas Gerais, Recife, Sao Paulo cities, Sao Paulo capital) or the value "miscellaneous" - speech_genre: Interviews, Dialogues, Monologues, Conversations, Interviews, Conference, Class Talks, Stage Talks or Reading - speech_style: Spontaneous Speech or Prepared Speech or Read Speech - up_votes: for annotation, the number of votes to valid the audio (most audios were revewed by one annotor, but some of the audios were analyzed by more than one). - down_votes: for annotation, the number of votes do invalid the audio (always smaller than up_votes) - votes_for_hesitation: for annotation, votes categorizing the audio as having the hesitation phenomenon - votes_for_filled_pause: for annotation, votes categorizing the audio as having the filled pause phenomenon - votes_for_noise_or_low_voice: for annotation, votes categorizing the audio as either having noise or low voice, without impairing the audio compression. - votes_for_second_voice: for annotation, votes categorizing the audio as having a second voice, without impairing the audio compression - votes_for_no_identified_problem: without impairing the audio as having no identified phenomenon (of the four described above) - text: the transcription for the audio ## Experiments: - [Checkpoints ](https://drive.google.com/drive/folders/10JkbCzYypZtCz1nHY5rBoBM1r66P3p3j?usp=sharing) - [Code](https://github.com/Edresson/Wav2Vec-Wrapper) Model trained in this corpus: Wav2Vec 2.0 XLSR-53 (multilingual pretraining) ## Citation - [Preprint](https://arxiv.org/abs/2110.15731): ``` @misc{c2021coraa, title={CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese}, author={Arnaldo Candido Junior and Edresson Casanova and Anderson Soares and Frederico Santos de Oliveira and Lucas Oliveira and Ricardo Corso Fernandes Junior and Daniel Peixoto Pinto da Silva and Fernando Gorgulho Fayet and Bruno Baldissera Carlotto and Lucas Rafael Stefanel Gris and Sandra Maria Aluísio}, year={2021}, eprint={2110.15731}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` - Full Paper: coming soon - Oficial site: [Tarsila Project](https://sites.google.com/view/tarsila-c4ai/) ## Partners / Sponsors / Funding - [C4AI](https://c4ai.inova.usp.br/pt/home-2/) - [CEIA](https://centrodeia.org/) - [UFG](https://www.ufg.br/) - [USP](https://www5.usp.br/) - [UTFPR](http://www.utfpr.edu.br/) ## References - Gonçalves SCL (2019) Projeto ALIP (amostra linguística do interior paulista) e banco de dados iboruna: 10 anos de contribuição com a descrição do Português Brasileiro. Revista Estudos Linguísticos 48(1):276–297. - Raso T, Mello H, Mittmann MM (2012) The C-ORAL-BRASIL I: Reference corpus for spoken Brazilian Portuguese. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, pp 106–113, URL http://www.lrec-conf.org/proceedings/lrec2012/pdf/624_Paper.pdf - Oliviera Jr M (2016) Nurc digital um protocolo para a digitalização, anotação, arquivamento e disseminação do material do projeto da norma urbana linguística culta (NURC). CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Linguísticos 3(2):149–174, URL https://revistas.uam.es/chimera/article/view/6519 - Mendes RB, Oushiro L (2012) Mapping Paulistano Portuguese: the SP2010 Project. In: Proceedings of the VIIth GSCP International Conference: Speech and Corpora, Fizenze University Press, Firenze, Italy, pp 459–463.

提供机构：

Racoci

原始信息汇总

数据集概述

基本信息

数据集名称: CORAA-v1.1
数据集大小: 约290.77小时音频，超过400,000个分割音频
语言: 巴西葡萄牙语 (PT_BR)
许可: Attribution-NonCommercial-NoDerivatives 4.0 International (CC-BY-NC-ND-4.0)

数据集组成

来源项目:
- ALIP (Gonçalves, 2019)
- C-ORAL Brazil (Raso and Mello, 2012)
- NURC-Recife (Oliviera Jr., 2016)
- SP-2010 (Mendes and Oushiro, 2012)
- TEDx talks (葡萄牙语)

数据集特征

file_path: 音频文件路径
task: 转录（注释者审核原始转录）；注释（注释者根据votes_for_*指标对音频-转录对进行分类）；注释和转录（执行了两个任务）
variety: 欧洲葡萄牙语 (PT_PT) 或巴西葡萄牙语 (PT_BR)
dataset: 五个数据集之一（ALIP, C-oral Brasil, NURC-RE, SP2010, TEDx Portuguese）
accent: 四种口音之一（米纳斯吉拉斯，累西腓，圣保罗市，圣保罗首都）或“杂项”
speech_genre: 访谈、对话、独白、对话、会议、课堂谈话、舞台谈话或阅读
speech_style: 自发演讲或准备演讲或阅读演讲
up_votes: 注释中，验证音频的投票数（大多数音频由一个注释者审核，但有些音频由多个注释者分析）
down_votes: 注释中，使音频无效的投票数（总是小于up_votes）
votes_for_hesitation: 注释中，将音频分类为具有犹豫现象的投票
votes_for_filled_pause: 注释中，将音频分类为具有填充暂停现象的投票
votes_for_noise_or_low_voice: 注释中，将音频分类为具有噪声或低声，不影响音频压缩的投票
votes_for_second_voice: 注释中，将音频分类为具有第二声音，不影响音频压缩的投票
votes_for_no_identified_problem: 注释中，将音频分类为没有识别到上述四种现象的投票
text: 音频的转录

数据集分割

train: 382,258个例子，631,134,046,871.162字节
dev: 7,522个例子，1,363,924,625字节
test: 12,676个例子，2,594,334,946字节

任务类别

自动语音识别 (ASR)
文本到语音 (TTS)

5,000+

优质数据集

54 个

任务类型

进入经典数据集