five

TuniSpeech-AI/TuniSpeech-21h

收藏
Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TuniSpeech-AI/TuniSpeech-21h
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - aeb - ar license: cc-by-nc-sa-4.0 size_categories: - 10K<n<100K task_categories: - automatic-speech-recognition tags: - tunisian-arabic - derja - speech-recognition - tunispeech - asr --- # TuniSpeech-21h ## Summary **TuniSpeech-21h** is a 21-hour speech corpus specifically designed for **Tunisian Arabic (Derja)**. It was developed to address the underrepresentation of this dialect in the landscape of Automatic Speech Recognition (ASR). The dataset is compiled from social media (YouTube and Facebook) and broadcast materials, capturing a wide range of spontaneous speech and diverse linguistic characteristics. ## Dataset Description - **Language:** Tunisian Arabic (Derja). - **Total Duration:** 21 hours, 6 minutes, and 57 seconds. - **Domains:** Includes politics, culture, sports, religion, music, stories, and everyday interactions. - **Speakers:** 187 unique speakers (120 males, 67 females). - **Speaker Age:** Mostly adults aged between 18 and 60 years. - **Format:** High-quality WAV files, uncompressed and resampled to 16 kHz. ## Dataset Statistics | Feature | Count | | :--- | :--- | | **Total Segments** | 32,294 | | **Total Words** | 170,488 | | **Total Unique Words** | 34,866 | | **Average Utterance Length** | 5.281 words | | **Mean Clip Duration** | 7 seconds | | **Max Clip Duration** | 10 seconds | ## Creation Pipeline The dataset was developed through a transparent and reproducible pipeline: 1. **Data Collection:** Content was sourced from YouTube and Facebook to ensure regional and demographic diversity. 2. **Preprocessing:** Includes audio normalization, noise reduction, and the removal of overlapping speech. 3. **Segmentation:** Automatic segmentation was performed based on natural pauses in speech, followed by manual verification. 4. **Transcription:** A semi-automatic approach was used: initial transcription via "TurboScribe" (Whisper-based) followed by manual review and correction using "Subtitle Edit". 5. **Transcription Style:** Clean Verbatim Transcription, which preserves the original message while omitting hesitations and filler words. ## Baseline Benchmark Several state-of-the-art ASR models were evaluated on this corpus. **Whisper large-v2** currently holds the best overall performance on TuniSpeech-21h: - **Word Error Rate (WER):** 24.74% - **Character Error Rate (CER):** 8.32% - **Mixed Error Rate (MER):** 16.53% ## Citation If you use this dataset in your research, please cite the following paper: > Sghaier, Mohamed Ali; Bellagha, Mohamed Lazhar and Zrigui, Mounir (2026). A New Tunisian Arabic Corpus and Benchmark for Automatic Speech Recognition. In Proceedings of the 18th International Conference on Agents and Artificial Intelligence.
提供机构:
TuniSpeech-AI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作