five

CAESAR-TINY

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/BSC-LT/CAESAR-TINY
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for CAESAR-TINY ## Dataset Description - **Homepage:** [Project Aina](https://www.bsc.es/research-and-development/not-assigned-pages/about-aina) - **Repository:** [CAESAR-TINY](https://huggingface.co/datasets/BSC-LT/CAESAR-TINY) ### Dataset Summary CAESAR-TINY is a synthetic code-switched dataset generated by combining monolingual samples in Catalan and Spanish. The process includes trimming silences, normalizing audio volume, and introducing random pauses. It contains 2 hours of speech data, created by concatenating audio from the [Common voice 17 Benchmark split](https://huggingface.co/datasets/projecte-aina/commonvoice_benchmark_catalan_accents) and [VoxForge Spanish](https://huggingface.co/datasets/ciempiess/voxforge_spanish) datasets. ### Example Usage To load CAESAR-TINY: ```python from datasets import load_dataset caesar_tiny = load_dataset("BSC-LT/CAESAR-TINY", split="train") ``` ### Supported Tasks The CAESAR-TINY dataset is designed for the Automatic Speech Recognition (ASR) task, enabling the transcription of utterances in Catalan, Spanish, and code-switched speech between the two languages. ### Languages The dataset features code-switched speech, combining Catalan (ca) and Spanish (es) within the same audio samples. ## Dataset Structure ### Data Instances ``` { 'audio': { 'path': '14.wav', 'array': array([0., 0., 0., ..., 0., 0., 0.]), 'sampling_rate': 16000 }, 'transcription': 'trons del cul tempestat de merda los mismos ocultándose para volver a aparecer con regularidad casi mecánica' } ``` ### Data Fields * `audio` (datasets.Audio) - a dictionary containing the path to the audio, the decoded audio array, and the sampling rate. * `transcription` (string) - normalized audio-segment transcription. ### Data Splits The dataset consists of a single split due to its limited size. ## Dataset Creation ### Curation Rationale This corpus specifically focuses on Catalan-Spanish code-switched, a linguistic phenomenon that is very common in the daily lives of Catalonians. This task is particularly low-resourced because, besides being a variety of the Catalan language, it further restricts the available data by incorporating code-switching, a complex and less-explored aspect of language use. With this release, we develop the first Catalan-Spanish CS dataset, which will be valuable mainly for training and evaluating Code-Switching Speech Recognition systems in Catalan and Spanish. ### Source Data The dataset was created by concatenating original audio from the [Common voice 17 Benchmark split](https://huggingface.co/datasets/projecte-aina/commonvoice_benchmark_catalan_accents) and [VoxForge Spanish](https://huggingface.co/datasets/ciempiess/voxforge_spanish) datasets. ### Data Collection and Processing The dataset was created using a two-step pipeline for generating synthetic code-switched speech data from monolingual sources based on [NeMo](https://github.com/NVIDIA/NeMo/tree/main/scripts/speech_recognition/code_switching) scripts. First, an intermediate manifest file was generated, which pairs utterances from two monolingual datasets based on specified language codes, duration constraints, and overall dataset size requirements. Next, we synthesized the speech data by concatenating selected segments, applying configurable pauses at the beginning, between segments, and at the end of each sample. The resulting dataset maintains linguistic diversity while ensuring consistency in audio normalization and sampling rate. This approach enables the creation of large-scale, high-quality code-switched datasets suitable for training and evaluating multilingual ASR models. ## Annotations The dataset doesn't contain any additional annotations. ## Considerations for Using the Data ### Social Impact of Dataset CAESAR-TINY is a source of code-switching speech data that will be valuable in the development of speech technologies for Catalan and Spanish. ### Discussion of Biases No specific bias mitigation strategies were applied to this dataset. Inherent biases may exist within the data. ### Other Known Limitations Gender in the speech recordings is identified, but one or more speakers could be speaking in the same recording. For this reason, we don't know the total number of speakers in the corpus. ### Dataset Curators The corpus was curated by [Abir Messaoudi](https://huggingface.co/AbirMessaoudi) during 2024 at the [Barcelona Supercomputing Center](https://www.bsc.es/). ### Licensing Information GNU General Public License v3.0 ### Citation Information ``` @misc{caesar-tiny-bsc2024, title={CAESAR collection for Catalan and Spanish Code-Switching datasets}, author={Messaoudi, Abir and Solito, Sarah and Kulebi, Baybars}, publisher={Barcelona Supercomputing Center}, year={2024}, url={https://huggingface.co/datasets/BSC-LT/CAESAR-TV3} } ``` ### Contributions This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).

# CAESAR-TINY 数据集卡片 ## 数据集说明 - **主页:** [Project Aina](https://www.bsc.es/research-and-development/not-assigned-pages/about-aina) - **仓库:** [CAESAR-TINY](https://huggingface.co/datasets/BSC-LT/CAESAR-TINY) ### 数据集概述 CAESAR-TINY是一个合成语码转换(code-switching)数据集,通过结合加泰罗尼亚语与西班牙语的单语样本生成。其处理流程包含修剪静音、归一化音频音量以及引入随机停顿。该数据集包含2小时的语音数据,通过拼接[Common Voice 17 基准拆分集](https://huggingface.co/datasets/projecte-aina/commonvoice_benchmark_catalan_accents)与[VoxForge 西班牙语数据集](https://huggingface.co/datasets/ciempiess/voxforge_spanish)的音频制作而成。 ### 示例用法 加载CAESAR-TINY的代码如下: python from datasets import load_dataset caesar_tiny = load_dataset("BSC-LT/CAESAR-TINY", split="train") ### 支持任务 本数据集专为自动语音识别(Automatic Speech Recognition, ASR)任务设计,可实现加泰罗尼亚语、西班牙语以及二者之间的语码转换语音的转写。 ### 语言类型 本数据集包含语码转换语音,即在同一段音频样本中混合加泰罗尼亚语(ca)与西班牙语(es)。 ## 数据集结构 ### 数据实例 { 'audio': { 'path': '14.wav', 'array': array([0., 0., 0., ..., 0., 0., 0.]), 'sampling_rate': 16000 }, 'transcription': 'trons del cul tempestat de merda los mismos ocultándose para volver a aparecer con regularidad casi mecánica' } ### 数据字段 * `audio`(datasets.Audio):包含音频路径、解码后的音频数组与采样率的字典。 * `transcription`(string):归一化后的音频片段转写文本。 ### 数据拆分 由于数据集规模有限,仅包含单个拆分。 ## 数据集构建 ### 筛选依据 本语料库专门针对加泰罗尼亚语-西班牙语语码转换这一加泰罗尼亚民众日常生活中极为常见的语言现象。该任务资源匮乏程度较高:除加泰罗尼亚语本身属于小众语种外,语码转换进一步限制了可用数据规模,这是一个复杂且尚未得到充分探索的语言使用场景。本次发布的数据集是首个加泰罗尼亚语-西班牙语语码转换数据集,主要可用于训练与评估加泰罗尼亚语与西班牙语的语码转换自动语音识别系统。 ### 源数据 本数据集通过拼接[Common Voice 17 基准拆分集](https://huggingface.co/datasets/projecte-aina/commonvoice_benchmark_catalan_accents)与[VoxForge 西班牙语数据集](https://huggingface.co/datasets/ciempiess/voxforge_spanish)的原始音频制作而成。 ### 数据收集与处理 本数据集基于[NeMo](https://github.com/NVIDIA/NeMo/tree/main/scripts/speech_recognition/code_switching)脚本的两阶段流水线,从单语源数据生成合成语码转换语音。首先生成中间清单文件,根据指定的语言代码、时长约束与整体数据集规模要求,将两个单语数据集的语句进行配对。随后,通过拼接选中的音频片段合成语音数据,并在样本开头、片段间以及每个样本末尾添加可配置的停顿。最终生成的数据集在保留语言多样性的同时,确保了音频归一化与采样率的一致性。该方法可用于创建大规模、高质量的语码转换数据集,适用于训练与评估多语种自动语音识别模型。 ## 标注信息 本数据集未包含额外标注。 ## 数据集使用注意事项 ### 数据集的社会影响 CAESAR-TINY是语码转换语音数据集的重要来源,将助力加泰罗尼亚语与西班牙语相关语音技术的开发。 ### 偏差讨论 本数据集未采用特定的偏差缓解策略,数据中可能存在固有偏差。 ### 其他已知局限 语音录音中已标注发言人性别,但单条录音可能包含多位发言人,因此无法确定语料库中的发言人总数。 ### 数据集整理者 本语料库由[Abir Messaoudi](https://huggingface.co/AbirMessaoudi)于2024年在[巴塞罗那超级计算中心](https://www.bsc.es/)整理完成。 ### 许可信息 GNU通用公共许可证v3.0(GNU General Public License v3.0) ### 引用信息 @misc{caesar-tiny-bsc2024, title={CAESAR collection for Catalan and Spanish Code-Switching datasets}, author={Messaoudi, Abir and Solito, Sarah and Kulebi, Baybars}, publisher={Barcelona Supercomputing Center}, year={2024}, url={https://huggingface.co/datasets/BSC-LT/CAESAR-TV3} } ### 资助信息 本工作由加泰罗尼亚政府通过[Aina项目](https://projecteaina.cat/)推广并资助。
提供机构:
maas
创建时间:
2025-10-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作