TuniSpeech-AI/TuniSpeech-21h
收藏Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TuniSpeech-AI/TuniSpeech-21h
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- aeb
- ar
license: cc-by-nc-sa-4.0
size_categories:
- 10K<n<100K
task_categories:
- automatic-speech-recognition
tags:
- tunisian-arabic
- derja
- speech-recognition
- tunispeech
- asr
---
# TuniSpeech-21h
## Summary
**TuniSpeech-21h** is a 21-hour speech corpus specifically designed for **Tunisian Arabic (Derja)**. It was developed to address the underrepresentation of this dialect in the landscape of Automatic Speech Recognition (ASR). The dataset is compiled from social media (YouTube and Facebook) and broadcast materials, capturing a wide range of spontaneous speech and diverse linguistic characteristics.
## Dataset Description
- **Language:** Tunisian Arabic (Derja).
- **Total Duration:** 21 hours, 6 minutes, and 57 seconds.
- **Domains:** Includes politics, culture, sports, religion, music, stories, and everyday interactions.
- **Speakers:** 187 unique speakers (120 males, 67 females).
- **Speaker Age:** Mostly adults aged between 18 and 60 years.
- **Format:** High-quality WAV files, uncompressed and resampled to 16 kHz.
## Dataset Statistics
| Feature | Count |
| :--- | :--- |
| **Total Segments** | 32,294 |
| **Total Words** | 170,488 |
| **Total Unique Words** | 34,866 |
| **Average Utterance Length** | 5.281 words |
| **Mean Clip Duration** | 7 seconds |
| **Max Clip Duration** | 10 seconds |
## Creation Pipeline
The dataset was developed through a transparent and reproducible pipeline:
1. **Data Collection:** Content was sourced from YouTube and Facebook to ensure regional and demographic diversity.
2. **Preprocessing:** Includes audio normalization, noise reduction, and the removal of overlapping speech.
3. **Segmentation:** Automatic segmentation was performed based on natural pauses in speech, followed by manual verification.
4. **Transcription:** A semi-automatic approach was used: initial transcription via "TurboScribe" (Whisper-based) followed by manual review and correction using "Subtitle Edit".
5. **Transcription Style:** Clean Verbatim Transcription, which preserves the original message while omitting hesitations and filler words.
## Baseline Benchmark
Several state-of-the-art ASR models were evaluated on this corpus. **Whisper large-v2** currently holds the best overall performance on TuniSpeech-21h:
- **Word Error Rate (WER):** 24.74%
- **Character Error Rate (CER):** 8.32%
- **Mixed Error Rate (MER):** 16.53%
## Citation
If you use this dataset in your research, please cite the following paper:
> Sghaier, Mohamed Ali; Bellagha, Mohamed Lazhar and Zrigui, Mounir (2026). A New Tunisian Arabic Corpus and Benchmark for Automatic Speech Recognition. In Proceedings of the 18th International Conference on Agents and Artificial Intelligence.
提供机构:
TuniSpeech-AI



