CAESAR-TV3

Name: CAESAR-TV3
Creator: maas
Published: 2025-12-05 16:33:10
License: 暂无描述

魔搭社区2025-12-05 更新2025-05-10 收录

下载链接：

https://modelscope.cn/datasets/BSC-LT/CAESAR-TV3

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset card for CAESAR-TV3 ## Dataset Description - **Homepage:** [Project Aina](https://www.bsc.es/research-and-development/not-assigned-pages/about-aina) - **Repository:** [CAESAR-TV3](https://huggingface.co/datasets/BSC-LT/CAESAR-TV3) ### Dataset Summary This corpus includes 5 hours and 45 minutes of Catalan speech code-switched with Spanish extracted from the original [tv3_parla](https://huggingface.co/datasets/collectivat/tv3_parla) dataset. ### Supported Tasks and Leaderboards The CAESAR-TV3 dataset is designed for the Automatic Speech Recognition (ASR) task, enabling the transcription of utterances in Catalan, Spanish, and code-switched speech between the two languages. ### Languages The dataset features code-switched speech, combining Catalan (ca) and Spanish (es) within the same audio samples. ## Dataset Structure ### Data Instances ``` { 'audio': { 'path': '1429389_1303379885477_289.900_296.740.wav', 'array': array([0.04263306, 0.06085205, 0.0710144 , ..., 0.04855347, 0.05911255, 0.03530884]), 'sampling_rate': 16000 }, 'transcription': "els dies de tempesta les onades fan un so esgarrifós en l'angosta fenedura de sa roncadora" } ``` ### Data Fields - `audio` (dict): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. - `text` (str): Transcription of the audio file. ### Data Splits The dataset is split into "train", "validation", and "test". ### Data loading ```python from datasets import DownloadConfig, load_dataset data = load_dataset("BSC-LT/CAESAR-TV3", download_config=download_config, data_dir="data") ``` ## Dataset Creation The original set was created by Baybars Külebi and Alp Öktem from [Collectivat](https://huggingface.co/collectivat). However, the selection and curation of the audios containing ca-es code-switched data was made by Jacobo Romero-Diaz. ### Curation Rationale This corpus specifically focuses on Catalan code-switched with Spanish, a linguistic phenomenon that is very common in the daily lives of Catalonians. This task is particularly low-resourced because, besides being a variety of the Catalan language, it further restricts the available data by incorporating code-switching, a complex and less-explored aspect of language use. ### Source Data This corpus was extracted from the original [tv3_parla](https://huggingface.co/datasets/collectivat/tv3_parla) dataset that includes 240 hours of Catalan speech from broadcast material. ### Data Collection and Processing To extract the CS part, we used the BERT detection. [Google’s multilingual BERT](https://arxiv.org/pdf/1810.04805) was fine-tuned for token classification using a synthetic corpus of code-switched dialogues in Catalan and Spanish. During fine-tuning, each word was labeled with its corresponding language token. Once trained, the model was applied to the transcriptions of the original TV3 Parla dataset, where it performed token-level language classification. This process resulted in a "language count" for each audio file, indicating the distribution of Catalan and Spanish within the transcription. Given that the audios were short, the audio was considered code-switched if Catalan and Spanish were present with at least three words each. With this method, we identified a substantial portion of code-switched data, totaling approximately 5 hours and 45 minutes. ## Annotations The dataset doesn't contain any additional annotations. ## Personal and Sensitive Information The dataset consists of speech from broadcast material. You agree not to attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset CAESAR-TV3 is a source of spontaneous Code-switching speech data that will be valuable in the development of speech technologies for Catalan. ### Discussion of Biases No specific bias mitigation strategies were applied to this dataset. Inherent biases may exist within the data. ### Other Known Limitations Speakers, their gender, and age are not identified, and one or more speakers could be speaking in the same recording. For these reasons, we don't know the total number of speakers in the corpus and their gender/age. ### Dataset Curators The corpus was curated by Jacobo Romero-Diaz in 2024 at the [Barcelona Supercomputing Center](https://www.bsc.es/). ### Licensing Information Creative Commons Attribution Non-Commercial 4.0 ### Citation Information ``` @misc{caesar-tv3-bsc2025, title={CAESAR collection for Catalan and Spanish Code-switching datasets}, author={Romero-Diaz, Jacobo and Messaoudi, Abir and Armentaro, Carme and Giraldo, José}, publisher={Barcelona Supercomputing Center}, year={2025}, url={https://huggingface.co/datasets/BSC-LT/CAESAR-TV3} } ``` ### Contributions This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).

# CAESAR-TV3 数据集卡片 ## 数据集说明 - **主页**：[Project Aina](https://www.bsc.es/research-and-development/not-assigned-pages/about-aina) - **仓库**：[CAESAR-TV3](https://huggingface.co/datasets/BSC-LT/CAESAR-TV3) ### 数据集概览该语料库提取自原始[tv3_parla](https://huggingface.co/datasets/collectivat/tv3_parla)数据集，包含5小时45分钟的加泰罗尼亚语与西班牙语**代码转换（code-switching）**语音数据。 ### 支持任务与基准测试集 CAESAR-TV3数据集专为**自动语音识别（Automatic Speech Recognition, ASR）**任务设计，可实现加泰罗尼亚语、西班牙语以及二者间代码转换语音片段的转录。 ### 语言类型该数据集包含代码转换语音，即同一段音频样本中同时包含加泰罗尼亚语（ca）与西班牙语（es）。 ## 数据集结构 ### 数据实例 { 'audio': { 'path': '1429389_1303379885477_289.900_296.740.wav', 'array': array([0.04263306, 0.06085205, 0.0710144 , ..., 0.04855347, 0.05911255, 0.03530884]), 'sampling_rate': 16000 }, 'transcription': "els dies de tempesta les onades fan un so esgarrifós en l'angosta fenedura de sa roncadora" } ### 数据字段 - `audio`（字典）：包含音频文件下载路径、解码后的音频数组以及采样率的字典。 - `text`（字符串）：音频文件的转录文本。 ### 数据划分数据集分为训练集（train）、验证集（validation）与测试集（test）。 ### 数据集加载 python from datasets import DownloadConfig, load_dataset data = load_dataset("BSC-LT/CAESAR-TV3", download_config=download_config, data_dir="data") ## 数据集创建原始数据集由Collectivat的Baybars Külebi与Alp Öktem创建，而包含加泰罗尼亚语-西班牙语代码转换数据的音频筛选与整理工作由Jacobo Romero-Diaz完成。 ### 整理依据该语料库专门聚焦加泰罗尼亚语与西班牙语的代码转换现象，这一语言现象在加泰罗尼亚地区民众的日常生活中十分常见。由于该任务不仅涉及加泰罗尼亚语这一小语种，还因加入代码转换这一复杂且研究较少的语言使用场景进一步限制了可用数据的规模，因此属于低资源任务。 ### 源数据该语料库源自原始[tv3_parla](https://huggingface.co/datasets/collectivat/tv3_parla)数据集，该数据集包含240小时来自广播素材的加泰罗尼亚语语音数据。 ### 数据收集与处理为提取代码转换数据，我们使用了基于BERT的检测模型。我们使用加泰罗尼亚语与西班牙语代码转换对话的合成语料库，对**谷歌多语言BERT（Google’s multilingual BERT）**进行了**令牌分类（token classification）**微调。微调过程中，每个单词都会被标记其对应的语言令牌。训练完成后，将该模型应用于原始TV3 Parla数据集的转录文本，执行令牌级语言分类。该过程会为每个音频文件生成“语言计数”，以显示转录文本中加泰罗尼亚语与西班牙语的分布情况。由于音频片段较短，若某段音频中加泰罗尼亚语与西班牙语的单词数均至少为3个，则将其视为代码转换语音。通过该方法，我们共提取到约5小时45分钟的代码转换数据。 ## 标注信息该数据集未包含额外标注。 ## 个人与敏感信息该数据集包含来自广播素材的语音数据。请您切勿尝试识别数据中说话者的身份。 ## 数据使用注意事项 ### 数据集的社会影响 CAESAR-TV3是自发式代码转换语音数据的宝贵来源，将对面向加泰罗尼亚语的语音技术开发具有重要价值。 ### 偏差讨论该数据集未采用特定的偏差缓解策略，数据中可能存在固有偏差。 ### 其他已知局限性未识别说话者的身份、性别与年龄，且单次录音中可能包含一位或多位说话者。基于上述原因，我们无法得知该语料库中说话者的总数及其性别与年龄分布。 ### 数据集整理者该语料库于2024年由**巴塞罗那超级计算中心（Barcelona Supercomputing Center）**的Jacobo Romero-Diaz整理。 ### 许可信息知识共享署名-非商业性使用4.0国际许可（Creative Commons Attribution Non-Commercial 4.0） ### 引用信息 @misc{caesar-tv3-bsc2025, title={CAESAR collection for Catalan and Spanish Code-switching datasets}, author={Romero-Diaz, Jacobo and Messaoudi, Abir and Armentaro, Carme and Giraldo, José}, publisher={Barcelona Supercomputing Center}, year={2025}, url={https://huggingface.co/datasets/BSC-LT/CAESAR-TV3} } ### 贡献说明本工作由加泰罗尼亚政府通过[Aina项目](https://projecteaina.cat/)推动并资助。

提供机构：

maas

创建时间：

2025-05-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集