CAESAR-TV3
收藏魔搭社区2025-12-05 更新2025-05-10 收录
下载链接:
https://modelscope.cn/datasets/BSC-LT/CAESAR-TV3
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset card for CAESAR-TV3
## Dataset Description
- **Homepage:** [Project Aina](https://www.bsc.es/research-and-development/not-assigned-pages/about-aina)
- **Repository:** [CAESAR-TV3](https://huggingface.co/datasets/BSC-LT/CAESAR-TV3)
### Dataset Summary
This corpus includes 5 hours and 45 minutes of Catalan speech code-switched with Spanish extracted from the original [tv3_parla](https://huggingface.co/datasets/collectivat/tv3_parla) dataset.
### Supported Tasks and Leaderboards
The CAESAR-TV3 dataset is designed for the Automatic Speech Recognition (ASR) task, enabling the transcription of utterances in Catalan, Spanish, and code-switched speech between the two languages.
### Languages
The dataset features code-switched speech, combining Catalan (ca) and Spanish (es) within the same audio samples.
## Dataset Structure
### Data Instances
```
{
'audio':
{
'path': '1429389_1303379885477_289.900_296.740.wav',
'array': array([0.04263306, 0.06085205, 0.0710144 , ..., 0.04855347, 0.05911255,
0.03530884]),
'sampling_rate': 16000
},
'transcription': "els dies de tempesta les onades fan un so esgarrifós en l'angosta fenedura de sa roncadora"
}
```
### Data Fields
- `audio` (dict): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.
- `text` (str): Transcription of the audio file.
### Data Splits
The dataset is split into "train", "validation", and "test".
### Data loading
```python
from datasets import DownloadConfig, load_dataset
data = load_dataset("BSC-LT/CAESAR-TV3", download_config=download_config, data_dir="data")
```
## Dataset Creation
The original set was created by Baybars Külebi and Alp Öktem from [Collectivat](https://huggingface.co/collectivat). However, the selection and curation of the audios containing ca-es code-switched data was made by Jacobo Romero-Diaz.
### Curation Rationale
This corpus specifically focuses on Catalan code-switched with Spanish, a linguistic phenomenon that is very common in the daily lives of Catalonians.
This task is particularly low-resourced because, besides being a variety of the Catalan language, it further restricts the available data by incorporating code-switching, a complex and less-explored aspect of language use.
### Source Data
This corpus was extracted from the original [tv3_parla](https://huggingface.co/datasets/collectivat/tv3_parla) dataset that includes 240 hours of Catalan speech from broadcast material.
### Data Collection and Processing
To extract the CS part, we used the BERT detection. [Google’s multilingual BERT](https://arxiv.org/pdf/1810.04805) was fine-tuned for token classification using a synthetic corpus of code-switched dialogues in Catalan and Spanish.
During fine-tuning, each word was labeled with its corresponding language token.
Once trained, the model was applied to the transcriptions of the original TV3 Parla dataset, where it performed token-level language classification.
This process resulted in a "language count" for each audio file, indicating the distribution of Catalan and Spanish within the transcription.
Given that the audios were short, the audio was considered code-switched if Catalan and Spanish were present with at least three words each.
With this method, we identified a substantial portion of code-switched data, totaling approximately 5 hours and 45 minutes.
## Annotations
The dataset doesn't contain any additional annotations.
## Personal and Sensitive Information
The dataset consists of speech from broadcast material. You agree not to attempt to determine the identity of speakers in this dataset.
## Considerations for Using the Data
### Social Impact of Dataset
CAESAR-TV3 is a source of spontaneous Code-switching speech data that will be valuable in the development of speech technologies for Catalan.
### Discussion of Biases
No specific bias mitigation strategies were applied to this dataset. Inherent biases may exist within the data.
### Other Known Limitations
Speakers, their gender, and age are not identified, and one or more speakers could be speaking in the same recording. For these reasons, we don't know the total number of speakers in the corpus and their gender/age.
### Dataset Curators
The corpus was curated by Jacobo Romero-Diaz in 2024 at the [Barcelona Supercomputing Center](https://www.bsc.es/).
### Licensing Information
Creative Commons Attribution Non-Commercial 4.0
### Citation Information
```
@misc{caesar-tv3-bsc2025,
title={CAESAR collection for Catalan and Spanish Code-switching datasets},
author={Romero-Diaz, Jacobo and Messaoudi, Abir and Armentaro, Carme and Giraldo, José},
publisher={Barcelona Supercomputing Center},
year={2025},
url={https://huggingface.co/datasets/BSC-LT/CAESAR-TV3}
}
```
### Contributions
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
# CAESAR-TV3 数据集卡片
## 数据集说明
- **主页**:[Project Aina](https://www.bsc.es/research-and-development/not-assigned-pages/about-aina)
- **仓库**:[CAESAR-TV3](https://huggingface.co/datasets/BSC-LT/CAESAR-TV3)
### 数据集概览
该语料库提取自原始[tv3_parla](https://huggingface.co/datasets/collectivat/tv3_parla)数据集,包含5小时45分钟的加泰罗尼亚语与西班牙语**代码转换(code-switching)**语音数据。
### 支持任务与基准测试集
CAESAR-TV3数据集专为**自动语音识别(Automatic Speech Recognition, ASR)**任务设计,可实现加泰罗尼亚语、西班牙语以及二者间代码转换语音片段的转录。
### 语言类型
该数据集包含代码转换语音,即同一段音频样本中同时包含加泰罗尼亚语(ca)与西班牙语(es)。
## 数据集结构
### 数据实例
{
'audio':
{
'path': '1429389_1303379885477_289.900_296.740.wav',
'array': array([0.04263306, 0.06085205, 0.0710144 , ..., 0.04855347, 0.05911255,
0.03530884]),
'sampling_rate': 16000
},
'transcription': "els dies de tempesta les onades fan un so esgarrifós en l'angosta fenedura de sa roncadora"
}
### 数据字段
- `audio`(字典):包含音频文件下载路径、解码后的音频数组以及采样率的字典。
- `text`(字符串):音频文件的转录文本。
### 数据划分
数据集分为训练集(train)、验证集(validation)与测试集(test)。
### 数据集加载
python
from datasets import DownloadConfig, load_dataset
data = load_dataset("BSC-LT/CAESAR-TV3", download_config=download_config, data_dir="data")
## 数据集创建
原始数据集由Collectivat的Baybars Külebi与Alp Öktem创建,而包含加泰罗尼亚语-西班牙语代码转换数据的音频筛选与整理工作由Jacobo Romero-Diaz完成。
### 整理依据
该语料库专门聚焦加泰罗尼亚语与西班牙语的代码转换现象,这一语言现象在加泰罗尼亚地区民众的日常生活中十分常见。由于该任务不仅涉及加泰罗尼亚语这一小语种,还因加入代码转换这一复杂且研究较少的语言使用场景进一步限制了可用数据的规模,因此属于低资源任务。
### 源数据
该语料库源自原始[tv3_parla](https://huggingface.co/datasets/collectivat/tv3_parla)数据集,该数据集包含240小时来自广播素材的加泰罗尼亚语语音数据。
### 数据收集与处理
为提取代码转换数据,我们使用了基于BERT的检测模型。我们使用加泰罗尼亚语与西班牙语代码转换对话的合成语料库,对**谷歌多语言BERT(Google’s multilingual BERT)**进行了**令牌分类(token classification)**微调。微调过程中,每个单词都会被标记其对应的语言令牌。训练完成后,将该模型应用于原始TV3 Parla数据集的转录文本,执行令牌级语言分类。该过程会为每个音频文件生成“语言计数”,以显示转录文本中加泰罗尼亚语与西班牙语的分布情况。由于音频片段较短,若某段音频中加泰罗尼亚语与西班牙语的单词数均至少为3个,则将其视为代码转换语音。通过该方法,我们共提取到约5小时45分钟的代码转换数据。
## 标注信息
该数据集未包含额外标注。
## 个人与敏感信息
该数据集包含来自广播素材的语音数据。请您切勿尝试识别数据中说话者的身份。
## 数据使用注意事项
### 数据集的社会影响
CAESAR-TV3是自发式代码转换语音数据的宝贵来源,将对面向加泰罗尼亚语的语音技术开发具有重要价值。
### 偏差讨论
该数据集未采用特定的偏差缓解策略,数据中可能存在固有偏差。
### 其他已知局限性
未识别说话者的身份、性别与年龄,且单次录音中可能包含一位或多位说话者。基于上述原因,我们无法得知该语料库中说话者的总数及其性别与年龄分布。
### 数据集整理者
该语料库于2024年由**巴塞罗那超级计算中心(Barcelona Supercomputing Center)**的Jacobo Romero-Diaz整理。
### 许可信息
知识共享署名-非商业性使用4.0国际许可(Creative Commons Attribution Non-Commercial 4.0)
### 引用信息
@misc{caesar-tv3-bsc2025,
title={CAESAR collection for Catalan and Spanish Code-switching datasets},
author={Romero-Diaz, Jacobo and Messaoudi, Abir and Armentaro, Carme and Giraldo, José},
publisher={Barcelona Supercomputing Center},
year={2025},
url={https://huggingface.co/datasets/BSC-LT/CAESAR-TV3}
}
### 贡献说明
本工作由加泰罗尼亚政府通过[Aina项目](https://projecteaina.cat/)推动并资助。
提供机构:
maas
创建时间:
2025-05-03



