Racoci/CORAA-v1.1
收藏Hugging Face2024-06-01 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Racoci/CORAA-v1.1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-nd-4.0
dataset_info:
features:
- name: file_path
dtype: string
- name: task
dtype: string
- name: variety
dtype: string
- name: dataset
dtype: string
- name: accent
dtype: string
- name: speech_genre
dtype: string
- name: speech_style
dtype: string
- name: up_votes
dtype: int64
- name: down_votes
dtype: int64
- name: votes_for_hesitation
dtype: float64
- name: votes_for_filled_pause
dtype: float64
- name: votes_for_noise_or_low_voice
dtype: float64
- name: votes_for_second_voice
dtype: float64
- name: votes_for_no_identified_problem
dtype: float64
- name: text
dtype: string
- name: audio
dtype: audio
splits:
- name: train
num_bytes: 63113404687.162
num_examples: 382258
- name: dev
num_bytes: 1363924625
num_examples: 7522
- name: test
num_bytes: 2594334946
num_examples: 12676
download_size: 66914186143
dataset_size: 67071664258.162
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: dev
path: data/dev-*
- split: test
path: data/test-*
task_categories:
- automatic-speech-recognition
- text-to-speech
language:
- pt
pretty_name: coraa
size_categories:
- 1K<n<10K
---
# CORAA-v1.1
[CORAA-v1.1](https://github.com/nilc-nlp/CORAA) is a publicly available dataset for Automatic Speech Recognition (ASR) in the Brazilian Portuguese language containing 290.77 hours of audios and their respective transcriptions (400k+ segmented audios). The dataset is composed of audios of 5 original projects:
- ALIP (Gonçalves, 2019)
- C-ORAL Brazil (Raso and Mello, 2012)
- NURC-Recife (Oliviera Jr., 2016)
- SP-2010 (Mendes and Oushiro, 2012)
- TEDx talks (talks in Portuguese)
The audios were either validated by annotators or transcripted for the first time aiming at the ASR task.
## LICENSE
[Attribution-NonCommercial-NoDerivatives 4.0 International](https://raw.githubusercontent.com/nilc-nlp/CORAA/main/LICENSE)
## Metadata
- file_path: the path to an audio file
- task: transcription (annotators revised original transcriptions); annotation (annotators classified the audio-transcription pair according to votes_for_* metrics); annotation_and_transcription (both tasks were performed)
- variety: European Portuguese (PT_PT) or Brazilian Portuguese (PT_BR)
- dataset: one of five datasets (ALIP, C-oral Brasil, NURC-RE, SP2010, TEDx Portuguese)
- accent: one of four accents (Minas Gerais, Recife, Sao Paulo cities, Sao Paulo capital) or the value "miscellaneous"
- speech_genre: Interviews, Dialogues, Monologues, Conversations, Interviews, Conference, Class Talks, Stage Talks or Reading
- speech_style: Spontaneous Speech or Prepared Speech or Read Speech
- up_votes: for annotation, the number of votes to valid the audio (most audios were revewed by one annotor, but some of the audios were analyzed by more than one).
- down_votes: for annotation, the number of votes do invalid the audio (always smaller than up_votes)
- votes_for_hesitation: for annotation, votes categorizing the audio as having the hesitation phenomenon
- votes_for_filled_pause: for annotation, votes categorizing the audio as having the filled pause phenomenon
- votes_for_noise_or_low_voice: for annotation, votes categorizing the audio as either having noise or low voice, without impairing the audio compression.
- votes_for_second_voice: for annotation, votes categorizing the audio as having a second voice, without impairing the audio compression
- votes_for_no_identified_problem: without impairing the audio as having no identified phenomenon (of the four described above)
- text: the transcription for the audio
## Experiments:
- [Checkpoints ](https://drive.google.com/drive/folders/10JkbCzYypZtCz1nHY5rBoBM1r66P3p3j?usp=sharing)
- [Code](https://github.com/Edresson/Wav2Vec-Wrapper)
Model trained in this corpus: Wav2Vec 2.0 XLSR-53 (multilingual pretraining)
## Citation
- [Preprint](https://arxiv.org/abs/2110.15731):
```
@misc{c2021coraa,
title={CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese},
author={Arnaldo Candido Junior and Edresson Casanova and Anderson Soares and Frederico Santos de Oliveira and Lucas Oliveira and Ricardo Corso Fernandes Junior and Daniel Peixoto Pinto da Silva and Fernando Gorgulho Fayet and Bruno Baldissera Carlotto and Lucas Rafael Stefanel Gris and Sandra Maria Aluísio},
year={2021},
eprint={2110.15731},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
- Full Paper: coming soon
- Oficial site: [Tarsila Project](https://sites.google.com/view/tarsila-c4ai/)
## Partners / Sponsors / Funding
- [C4AI](https://c4ai.inova.usp.br/pt/home-2/)
- [CEIA](https://centrodeia.org/)
- [UFG](https://www.ufg.br/)
- [USP](https://www5.usp.br/)
- [UTFPR](http://www.utfpr.edu.br/)
## References
- Gonçalves SCL (2019) Projeto ALIP (amostra linguística do interior paulista) e banco de dados iboruna: 10 anos de contribuição com a descrição do Português Brasileiro. Revista Estudos Linguísticos 48(1):276–297.
- Raso T, Mello H, Mittmann MM (2012) The C-ORAL-BRASIL I: Reference corpus for spoken Brazilian Portuguese. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, pp 106–113, URL http://www.lrec-conf.org/proceedings/lrec2012/pdf/624_Paper.pdf
- Oliviera Jr M (2016) Nurc digital um protocolo para a digitalização, anotação, arquivamento e disseminação do material do projeto da norma urbana linguística culta (NURC). CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Linguísticos 3(2):149–174, URL https://revistas.uam.es/chimera/article/view/6519
- Mendes RB, Oushiro L (2012) Mapping Paulistano Portuguese: the SP2010 Project. In: Proceedings of the VIIth GSCP International Conference: Speech and Corpora, Fizenze University Press, Firenze, Italy, pp 459–463.
提供机构:
Racoci
原始信息汇总
数据集概述
基本信息
- 数据集名称: CORAA-v1.1
- 数据集大小: 约290.77小时音频,超过400,000个分割音频
- 语言: 巴西葡萄牙语 (PT_BR)
- 许可: Attribution-NonCommercial-NoDerivatives 4.0 International (CC-BY-NC-ND-4.0)
数据集组成
- 来源项目:
- ALIP (Gonçalves, 2019)
- C-ORAL Brazil (Raso and Mello, 2012)
- NURC-Recife (Oliviera Jr., 2016)
- SP-2010 (Mendes and Oushiro, 2012)
- TEDx talks (葡萄牙语)
数据集特征
- file_path: 音频文件路径
- task: 转录(注释者审核原始转录);注释(注释者根据votes_for_*指标对音频-转录对进行分类);注释和转录(执行了两个任务)
- variety: 欧洲葡萄牙语 (PT_PT) 或巴西葡萄牙语 (PT_BR)
- dataset: 五个数据集之一(ALIP, C-oral Brasil, NURC-RE, SP2010, TEDx Portuguese)
- accent: 四种口音之一(米纳斯吉拉斯,累西腓,圣保罗市,圣保罗首都)或“杂项”
- speech_genre: 访谈、对话、独白、对话、会议、课堂谈话、舞台谈话或阅读
- speech_style: 自发演讲或准备演讲或阅读演讲
- up_votes: 注释中,验证音频的投票数(大多数音频由一个注释者审核,但有些音频由多个注释者分析)
- down_votes: 注释中,使音频无效的投票数(总是小于up_votes)
- votes_for_hesitation: 注释中,将音频分类为具有犹豫现象的投票
- votes_for_filled_pause: 注释中,将音频分类为具有填充暂停现象的投票
- votes_for_noise_or_low_voice: 注释中,将音频分类为具有噪声或低声,不影响音频压缩的投票
- votes_for_second_voice: 注释中,将音频分类为具有第二声音,不影响音频压缩的投票
- votes_for_no_identified_problem: 注释中,将音频分类为没有识别到上述四种现象的投票
- text: 音频的转录
数据集分割
- train: 382,258个例子,631,134,046,871.162字节
- dev: 7,522个例子,1,363,924,625字节
- test: 12,676个例子,2,594,334,946字节
任务类别
- 自动语音识别 (ASR)
- 文本到语音 (TTS)



