tunis-ai/TunSwitch
收藏Hugging Face2024-05-04 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/tunis-ai/TunSwitch
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
pretty_name: TunSwitch
---
Original dataset has been acquired through the following link : https://zenodo.org/records/8370566
The dataset is not cleaned yet and any contributions are welcome 🤗
## download instructions
```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="tunis-ai/TunSwitch",repo_type="dataset",local_dir=".")
```
## Information
This repo contains the data used to develop and test the Tunisian Arabic Automatic Speech Recognition model developed in the following paper :
A. A. Ben Abdallah*, A. Kabboudi, A. Kanoun, and S. Zaiem*, “Leveraging data collection and unsupervised learning for code-switched tunisian arabic automatic speech recognition”, Submitted to ICASSP 2024, vol. * : These two authors have contributed equally. 2023.
It contains 4 zipped folders containing audio data :
- TunSwitchCS.zip : containing annotated code-switched data.
- TunSwitchTO.zip : containing annotated Tunisian-Only data.
- weakly_labeled_tn.zip : containing weakly-labeled (or unlabeled) audio data. Audios may contain code-switching, but the current weak labels do not.
- test_wavs.zip : contains annotated testing data, divided between a code-switched part and a tunisian-only part.
It also contains textual data, used for language modelling, contained in TextData.zip. Finally it also contains a language-detailed annotation of TunSwitchCS in the language_annotation.zip file .
More details about the data are available in the paper. The current table are in a SpeechBrain-friendly format, the column path is irrelevant and has to be changed according to your local setting. Please use the provided train-dev-test splits if you work with this dataset.
Please cite the aforementioned paper if you use or refer to this dataset. You can find models trained and tested on this dataset Here. Space demos are also available.
If you use or refer to this dataset, please cite :
## citation
```
@misc{abdallah2023leveraging,
title={Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition},
author={Ahmed Amine Ben Abdallah and Ata Kabboudi and Amir Kanoun and Salah Zaiem},
year={2023},
eprint={2309.11327},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
```
提供机构:
tunis-ai
原始信息汇总
数据集概述
数据集名称
- TunSwitch
数据集内容
-
音频数据
- TunSwitchCS.zip:包含注释的代码混合数据。
- TunSwitchTO.zip:包含注释的Tunisian-Only数据。
- weakly_labeled_tn.zip:包含弱标记(或未标记)的音频数据。音频可能包含代码切换,但当前的弱标签不包含。
- test_wavs.zip:包含注释的测试数据,分为代码混合部分和Tunisian-Only部分。
-
文本数据
- TextData.zip:用于语言建模的文本数据。
-
语言详细注释
- language_annotation.zip:TunSwitchCS的语言详细注释文件。
数据集用途
- 用于开发和测试Tunisian Arabic自动语音识别模型。
引用信息
-
若使用或引用此数据集,请引用以下文献:
@misc{abdallah2023leveraging, title={Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition}, author={Ahmed Amine Ben Abdallah and Ata Kabboudi and Amir Kanoun and Salah Zaiem}, year={2023}, eprint={2309.11327}, archivePrefix={arXiv}, primaryClass={eess.AS} }



