five

tunis-ai/TunSwitch

收藏
Hugging Face2024-05-04 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/tunis-ai/TunSwitch
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar pretty_name: TunSwitch --- Original dataset has been acquired through the following link : https://zenodo.org/records/8370566 The dataset is not cleaned yet and any contributions are welcome 🤗 ## download instructions ```python from huggingface_hub import snapshot_download snapshot_download(repo_id="tunis-ai/TunSwitch",repo_type="dataset",local_dir=".") ``` ## Information This repo contains the data used to develop and test the Tunisian Arabic Automatic Speech Recognition model developed in the following paper : A. A. Ben Abdallah*, A. Kabboudi, A. Kanoun, and S. Zaiem*, “Leveraging data collection and unsupervised learning for code-switched tunisian arabic automatic speech recognition”, Submitted to ICASSP 2024, vol. * : These two authors have contributed equally. 2023. It contains 4 zipped folders containing audio data : - TunSwitchCS.zip : containing annotated code-switched data. - TunSwitchTO.zip : containing annotated Tunisian-Only data. - weakly_labeled_tn.zip : containing weakly-labeled (or unlabeled) audio data. Audios may contain code-switching, but the current weak labels do not. - test_wavs.zip : contains annotated testing data, divided between a code-switched part and a tunisian-only part. It also contains textual data, used for language modelling, contained in TextData.zip. Finally it also contains a language-detailed annotation of TunSwitchCS in the language_annotation.zip file . More details about the data are available in the paper. The current table are in a SpeechBrain-friendly format, the column path is irrelevant and has to be changed according to your local setting. Please use the provided train-dev-test splits if you work with this dataset. Please cite the aforementioned paper if you use or refer to this dataset. You can find models trained and tested on this dataset Here. Space demos are also available. If you use or refer to this dataset, please cite : ## citation ``` @misc{abdallah2023leveraging, title={Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition}, author={Ahmed Amine Ben Abdallah and Ata Kabboudi and Amir Kanoun and Salah Zaiem}, year={2023}, eprint={2309.11327}, archivePrefix={arXiv}, primaryClass={eess.AS} } ```
提供机构:
tunis-ai
原始信息汇总

数据集概述

数据集名称

  • TunSwitch

数据集内容

  • 音频数据

    • TunSwitchCS.zip:包含注释的代码混合数据。
    • TunSwitchTO.zip:包含注释的Tunisian-Only数据。
    • weakly_labeled_tn.zip:包含弱标记(或未标记)的音频数据。音频可能包含代码切换,但当前的弱标签不包含。
    • test_wavs.zip:包含注释的测试数据,分为代码混合部分和Tunisian-Only部分。
  • 文本数据

    • TextData.zip:用于语言建模的文本数据。
  • 语言详细注释

    • language_annotation.zip:TunSwitchCS的语言详细注释文件。

数据集用途

  • 用于开发和测试Tunisian Arabic自动语音识别模型。

引用信息

  • 若使用或引用此数据集,请引用以下文献:

    @misc{abdallah2023leveraging, title={Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition}, author={Ahmed Amine Ben Abdallah and Ata Kabboudi and Amir Kanoun and Salah Zaiem}, year={2023}, eprint={2309.11327}, archivePrefix={arXiv}, primaryClass={eess.AS} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作