LaFresCat: a Catalan multi-accent speech dataset for text-to-speech
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/records/14588400
下载链接
链接失效反馈官方服务:
资源简介:
LaFresCat Multiaccent
We present LaFresCat, the first Catalan multiaccented and multispeaker dataset.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Commercial use is only possible through licensing by the voice artists. For further information, contact langtech@bsc.es and lafrescaproduccions@gmail.com.
Dataset Details
Dataset Description
The audios from this dataset have been created with professional studio recordings by professional voice actors in Lafresca Creative Studio. This is the raw version of the dataset, no resampling or trimming has been applied to the audios. Audios are stored in wav format at 48khz sampling rate
In total, there are 4 different accents, with 2 speakers per accent (female and male). After trimming, accumulates a total of 3,75h (divided by speaker IDs) as follows:
Balear
olga -> 23.5 min
quim -> 30.93 min
Central
elia -> 33.14 min
grau -> 37,86 min
Occidental (North-Western)
emma -> 28,67 min
pere -> 25,12 min
Valencia
gina -> 22,25 min
lluc -> 23,58 min
Uses
The purpose of this dataset is mainly for training text-to-speech and automatic speech recognition models in Catalan accents.
Languages
The dataset is in Catalan (ca-ES).
Dataset Structure
The dataset consists of 2858 audios and transcriptions in the following structure:lafresca_multiaccent_raw├── balear│ ├── olga│ ├── olga.txt│ ├── quim│ └── quim.txt├── central│ ├── elia│ ├── elia.txt│ ├── grau│ └── grau.txt├── full_filelist.txt├── occidental│ ├── emma│ ├── emma.txt│ ├── pere│ └── pere.txt└── valencia ├── gina ├── gina.txt ├── lluc └── lluc.txt
Metadata of the dataset can be found in the file `full_filelist.txt` , each line represents an audio and follows the format:
audio_path | speaker_id | transcription
The speaker ids have the following mapping:
"quim": 0,"olga": 1,"grau": 2,"elia": 3,"pere": 4,"emma": 5,"lluc": 6,"gina": 7
Dataset Creation
This dataset has been created by members of the Language Technologies unit from the Life Sciences department of the Barcelona Supercomputing Center, except the valencian sentences which were created with the support of Cenid, the Digital Intelligence Center of the University of Alicante. The voices belong to professional voice actors and they've been recorded in Lafresca Creative Studio.
Source Data
The data presented in this dataset is the source data.
Data Collection and Processing
These are the technical details of the data collection and processing:
Microphone: Austrian Audio oc818
Preamp: Focusrite ISA Two
Audio Interface: Antelope Orion 32+
DAW: ProTools 2023.6.0
Processing:
Noise Gate: C1 Gate
Compression BF-76
De-Esser Renaissance
EQ Maag EQ2
EQ FabFilter Pro-Q3
Limiter: L1 Ultramaximizer
Here's the information about the speakers:
Dialect
Gender
County
Central
male
Barcelonès
Central
female
Barcelonès
Balear
female
Pla de Mallorca
Balear
male
Llevant
Occidental
male
Baix Ebre
Occidental
female
Baix Ebre
Valencian
female
Ribera Alta
Valencian
male
La Plana Baixa
Who are the source data producers?
The Language Technologies team from the Life Sciences department at the Barcelona Supercomputing Center developed this dataset. It features recordings by professional voice actors made at Lafresca Creative Studio.
Annotations
In order to check whether or not there were any errors in the transcriptions of the audios, we created a Label Studio space. In that space, we manually listened to subset of the dataset, and compared what we heard with the transcription. If the transcription was mistaken, we corrected it.
Personal and Sensitive Information
The dataset consists of professional voice actors who have recorded their voice. You agree to not attempt to determine the identity of speakers in this dataset.
Bias, Risks, and Limitations
Training a Text-to-Speech (TTS) model by fine-tuning with a Catalan speaker who speaks a particular dialect presents significant limitations. Mostly, the challenge is in capturing the full range of variability inherent in that accent. Each dialect has its own unique phonetic, intonational, and prosodic characteristics that can vary greatly even within a single linguistic region. Consequently, a TTS model trained on a narrow dialect sample will struggle to generalize across different accents and sub-dialects, leading to reduced accuracy and naturalness. Additionally, achieving a standard representation is exceedingly difficult because linguistic features can differ markedly not only between dialects but also among individual speakers within the same dialect group. These variations encompass subtle nuances in pronunciation, rhythm, and speech patterns that are challenging to standardize in a model trained on a limited dataset.
Funding
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project, in addition the Valencian sentences have been created within the framework of the NEL-VIVES project 2022/TL22/00215334.
Dataset Card Contact
langtech@bsc.es
创建时间:
2025-02-18



