five

TunSwitch: Code-Switched Tunisian Arabic Speech Dataset

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/8342761
下载链接
链接失效反馈
官方服务:
资源简介:
We developed a tool for collecting Tunisian dialect data, prompting users to record themselves reading provided phrases. We sourced sentences from Tunisiya. These sentences are consequently removed from the LM training corpus. 89 persons have participated leading to the collection of 2631 distinct phrases. This set will be called TunSwitch TO, ``TO" standing for Tunisian Only, as these sentences do not have non-Tunisian words.  In response to the limited availability of paired Text-Speech Tunisian datasets with  code-switching, we have built a  corpus through meticulous manual annotation. Whenever encountered, French and English  words are enclosed  within "<>" tags, and left Tunisian words without any enclosing tags. While these tags have not been used in the proposed models, they allow to have language-usage statistics  and may be useful for further approaches handling code-switching. The resulting set is released as TunSwitch CS, ``CS" standing for Code-Switched. The TunSwitch CS dataset samples come from a set of radio shows and podcasts, representing diverse topics and a large number of unique speakers. The audio are first segmented into chunks, prioritizing word integrity using the WebRTC-VAD algorithm for silence detection. Afterward, we used a Pyannote overlap detection model to remove overlapping speech sections. Then, a music detection model is employed to eliminate music-containing chunks that could disrupt ASR model accuracy.
创建时间:
2024-07-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作