inesc-id/camoes_SI
收藏Hugging Face2025-12-02 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/inesc-id/camoes_SI
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- pt
license: cc-by-nc-4.0
pretty_name: camoes_SI
size_categories:
- 10K\<n\<100K
tags:
- audio
- speech_recognition
- portuguese
- european_portuguese
- sociolinguistics
task_categories:
- automatic-speech-recognition
---
# camoes_SI
## Dataset Description
The **camoes_SI** dataset is a curated combination of two European
Portuguese sociolinguistic corpora --- **Fala Bracarense** and
**Português Fundamental** --- merged into a unified **test-only**
dataset for evaluating Automatic Speech Recognition (ASR) systems.
All audio is provided as **16 kHz PCM waveforms**, accompanied by
speaker metadata and reference transcripts.
This dataset corresponds to the **Sociolinguistic Interviews (SI)**
category of the **CAMÕES benchmark**, introduced in:
**CAMÕES: A Comprehensive Automatic Speech Recognition Benchmark for European Portuguese**
In the CAMÕES benchmark, the SI category represents:
> **Highly spontaneous conversational speech**, recorded in various
> Portuguese regions and social contexts, often with **poor recording
> conditions** and highly **accented speech** making this the most challening
> domain in this benchmark.
As mention above, SI is the **most difficult domain** in the benchmark and a
strong indicator of model robustness in real conversational scenarios.
------------------------------------------------------------------------
## Source Components
The dataset here on Hugging Face is the **union of two corpora**, both
referenced in the CAMÕES benchmark:
### **Fala Bracarense**
- **Duration:** 6.1 hours
- **Speakers:** 9
- **Age range:** 15--92
- **Gender:** 45% male \| 55% female
- **Location:** Braga, Portugal
- **Collection period:** 2009--2014
### **Português Fundamental**
- **Duration:** 4.2 hours
- **Speakers:** 169
- **Age range:** 17--69
- **Gender:** 44% male \| 56% female
- **Collection period:** 1970s
------------------------------------------------------------------------
## Data Fields
| Column Name | Type | Description |
|--------------|--------|-------------|
| `audio` | Audio | 16 kHz PCM waveform of the utterance |
| `age` | string | Speaker-reported age |
| `gender` | string | Speaker gender |
| `speaker_id` | string | Unique speaker identifier |
| `hypothesis` | string | ASR hypothesis transcript (if available) |
| `reference` | string | Ground-truth transcript |
| `wrd` | string | Word count or related metric |
| `wer` | string | Word Error Rate (if available) |
| `dataset` | string | Source dataset identifier |
| `ID` | string | Unique utterance ID |
| `ncount` | string | Additional metadata field |
| `sex` | string | Speaker sex / alternative gender field |
------------------------------------------------------------------------
## Example Usage
``` python
from datasets import load_dataset, Audio
# Load full test split
ds = load_dataset("inesc-id/camoes_SI", split="test")
# Filter for FBracarense or PTFundamental or keep both and skip this line
fb = ds.filter(lambda x: x["dataset"] == "FBracarense")
```
------------------------------------------------------------------------
## License
This dataset is released under the **CC BY-NC 4.0** license.\
You may use, modify, and redistribute the dataset for **non-commercial
research purposes**.\
Commercial use is **not permitted**.
------------------------------------------------------------------------
## References
### Fala Bracarense
``` bibtex
@misc{FalaBracarense2009,
author = {{Centro de Estudos Humanísticos, Universidade do Minho}},
title = {Perfil Sociolinguístico da Fala Bracarense},
year = {2009},
howpublished = {\url{https://sites.google.com/site/projectofalabracarense/}},
note = {Accessed: 2025-05-21}
}
```
### Português Fundamental
``` bibtex
@misc{PTFundamental2014,
author = {{Centro de Linguística, Universidade de Lisboa}},
title = {Português Fundamental},
year = {2014},
howpublished = {\url{https://www.islrn.org/resources/812-337-422-842-3/}},
note = {Accessed: 2025-05-21}
}
```
### CAMÕES Benchmark
``` bibtex
@inproceedings{2025CAMOES,
title={CAMÕES: A Comprehensive Automatic Speech Recognition Benchmark for European Portuguese},
author={Carvalho, Carlos and Teixeira, Francisco and Botelho, Catarina and Pompili, Anna and Solera-Ureña, Rubén and Paulo, Sérgio and Julião, Mariana and Rolland, Thomas and Mendonça, John and Pereira, Diogo and Trancoso, Isabel and Abad, Alberto},
booktitle={IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
year={2025}
}
```
提供机构:
inesc-id



