inesc-id/camoes_SI

Name: inesc-id/camoes_SI
Creator: inesc-id
Published: 2025-12-02 12:49:22
License: 暂无描述

Hugging Face2025-12-02 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/inesc-id/camoes_SI

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - pt license: cc-by-nc-4.0 pretty_name: camoes_SI size_categories: - 10K\<n\<100K tags: - audio - speech_recognition - portuguese - european_portuguese - sociolinguistics task_categories: - automatic-speech-recognition --- # camoes_SI ## Dataset Description The **camoes_SI** dataset is a curated combination of two European Portuguese sociolinguistic corpora --- **Fala Bracarense** and **Português Fundamental** --- merged into a unified **test-only** dataset for evaluating Automatic Speech Recognition (ASR) systems. All audio is provided as **16 kHz PCM waveforms**, accompanied by speaker metadata and reference transcripts. This dataset corresponds to the **Sociolinguistic Interviews (SI)** category of the **CAMÕES benchmark**, introduced in: **CAMÕES: A Comprehensive Automatic Speech Recognition Benchmark for European Portuguese** In the CAMÕES benchmark, the SI category represents: > **Highly spontaneous conversational speech**, recorded in various > Portuguese regions and social contexts, often with **poor recording > conditions** and highly **accented speech** making this the most challening > domain in this benchmark. As mention above, SI is the **most difficult domain** in the benchmark and a strong indicator of model robustness in real conversational scenarios. ------------------------------------------------------------------------ ## Source Components The dataset here on Hugging Face is the **union of two corpora**, both referenced in the CAMÕES benchmark: ### **Fala Bracarense** - **Duration:** 6.1 hours - **Speakers:** 9 - **Age range:** 15--92 - **Gender:** 45% male \| 55% female - **Location:** Braga, Portugal - **Collection period:** 2009--2014 ### **Português Fundamental** - **Duration:** 4.2 hours - **Speakers:** 169 - **Age range:** 17--69 - **Gender:** 44% male \| 56% female - **Collection period:** 1970s ------------------------------------------------------------------------ ## Data Fields | Column Name | Type | Description | |--------------|--------|-------------| | `audio` | Audio | 16 kHz PCM waveform of the utterance | | `age` | string | Speaker-reported age | | `gender` | string | Speaker gender | | `speaker_id` | string | Unique speaker identifier | | `hypothesis` | string | ASR hypothesis transcript (if available) | | `reference` | string | Ground-truth transcript | | `wrd` | string | Word count or related metric | | `wer` | string | Word Error Rate (if available) | | `dataset` | string | Source dataset identifier | | `ID` | string | Unique utterance ID | | `ncount` | string | Additional metadata field | | `sex` | string | Speaker sex / alternative gender field | ------------------------------------------------------------------------ ## Example Usage ``` python from datasets import load_dataset, Audio # Load full test split ds = load_dataset("inesc-id/camoes_SI", split="test") # Filter for FBracarense or PTFundamental or keep both and skip this line fb = ds.filter(lambda x: x["dataset"] == "FBracarense") ``` ------------------------------------------------------------------------ ## License This dataset is released under the **CC BY-NC 4.0** license.\ You may use, modify, and redistribute the dataset for **non-commercial research purposes**.\ Commercial use is **not permitted**. ------------------------------------------------------------------------ ## References ### Fala Bracarense ``` bibtex @misc{FalaBracarense2009, author = {{Centro de Estudos Humanísticos, Universidade do Minho}}, title = {Perfil Sociolinguístico da Fala Bracarense}, year = {2009}, howpublished = {\url{https://sites.google.com/site/projectofalabracarense/}}, note = {Accessed: 2025-05-21} } ``` ### Português Fundamental ``` bibtex @misc{PTFundamental2014, author = {{Centro de Linguística, Universidade de Lisboa}}, title = {Português Fundamental}, year = {2014}, howpublished = {\url{https://www.islrn.org/resources/812-337-422-842-3/}}, note = {Accessed: 2025-05-21} } ``` ### CAMÕES Benchmark ``` bibtex @inproceedings{2025CAMOES, title={CAMÕES: A Comprehensive Automatic Speech Recognition Benchmark for European Portuguese}, author={Carvalho, Carlos and Teixeira, Francisco and Botelho, Catarina and Pompili, Anna and Solera-Ureña, Rubén and Paulo, Sérgio and Julião, Mariana and Rolland, Thomas and Mendonça, John and Pereira, Diogo and Trancoso, Isabel and Abad, Alberto}, booktitle={IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, year={2025} } ```

提供机构：

inesc-id

5,000+

优质数据集

54 个

任务类型

进入经典数据集