SilencioNetwork/swahili-speech
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SilencioNetwork/swahili-speech
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- sw
task_categories:
- automatic-speech-recognition
- audio-classification
- text-to-speech
tags:
- swahili
- kiswahili
- east-africa
- kenya
- tanzania
- uganda
- rwanda
- burundi
- drc
- african-languages
- low-resource
- speech-data
- voice-ai
- asr
- tts
- africa
pretty_name: "🇰🇪 Swahili Speech Dataset"
dataset_info:
features:
- name: file_name
dtype: string
- name: id
dtype: int64
- name: gender
dtype: string
- name: ethnicity
dtype: string
- name: occupation
dtype: string
- name: birth_place
dtype: string
- name: mother_tongue
dtype: string
- name: dialect
dtype: string
- name: year_of_birth
dtype: int64
- name: years_at_birth_place
dtype: int64
- name: languages_data
dtype: string
- name: os
dtype: string
- name: device
dtype: string
- name: browser
dtype: string
- name: duration
dtype: float64
- name: emotions
dtype: string
- name: language
dtype: string
- name: location
dtype: string
- name: noise_sources
dtype: string
- name: script_id
dtype: int64
- name: type_of_script
dtype: string
- name: script
dtype: string
- name: transcript
dtype: string
- name: speaker_id
dtype: string
configs:
- config_name: swahili_kenya
data_files:
- split: free_speech
path: swahili_kenya/free_speech/**
size_categories:
- n<1K
---
# 🇰🇪 Swahili Speech Dataset
<p align="left">
<img src="https://cdn-uploads.huggingface.co/production/uploads/69162b50b89e7abe20de4b5a/LWhs4p2lPFcyiVsP0tluu.png" width="40%">
</p>
[](https://www.silencioai.com)
[](mailto:sofia@silencioai.com)
[](mailto:sofia@silencioai.com)
---
> **🌍 Swahili — The lingua franca of East Africa.**
>
> Spoken by **200+ million people** across Kenya, Tanzania, Uganda, Rwanda, Burundi, and the DRC.
>
> **📧 Need more?** [sofia@silencioai.com](mailto:sofia@silencioai.com) — we have **9,786 hours** of Swahili voice data.
---
## 🎯 Dataset Overview
**47 high-quality Swahili recordings** (~21 minutes) from native speakers across East Africa.
| Language | Speakers | Regions | Sample Size |
|----------|----------|---------|-------------|
| 🇰🇪 **Kiswahili** | Native speakers | Kenya, Tanzania, Uganda | **47 recordings** |
### Speaker Demographics
- **Gender balance:** Mixed male/female
- **Regions:** Kenyan Swahili (Nairobi, Mombasa), Tanzanian Swahili
- **Ages:** 18-60+
- **Recording quality:** Real-world mobile recordings, natural speech
---
## 🚀 Quick Start
```python
from datasets import load_dataset
# Load dataset
swahili = load_dataset("SilencioNetwork/swahili-speech")
# Process samples
for sample in swahili['train']:
audio = sample['audio']
transcript = sample['transcript']
gender = sample['gender']
print(f"[{gender}] {transcript[:50]}...")
```
---
## 🌍 Why Swahili?
Swahili (Kiswahili) is one of Africa's most important languages:
- 🗣️ **200+ million speakers** across East and Central Africa
- 🇰🇪 **Official language** of Kenya, Tanzania, Uganda, Rwanda
- 💼 **Growing digital economy** — mobile banking, e-commerce booming
- 📱 **Tech adoption** — M-Pesa, voice AI demand rising
- 🌐 **Pan-African lingua franca** — Used across 10+ countries
Yet Swahili remains **severely underrepresented** in voice AI datasets.
---
## 📊 Full Data Availability
This sample is **<1%** of our Swahili corpus.
| Category | This Sample | Full Corpus Available |
|----------|-------------|----------------------|
| **Swahili (Kenya)** | 47 recordings (~21 min) | **9,786 hours** |
| **Total** | **47** | **9,786 hours** |
### What We Have
- ✅ **9,786 hours** of Swahili voice data
- ✅ Native speakers from Kenya, Tanzania, Uganda
- ✅ Multiple dialects and accents
- ✅ Real-world recording conditions
- ✅ Transcriptions available
- ✅ Rich metadata (gender, age, region, emotion)
---
## 📋 Metadata
Each recording includes:
| Field | Description |
|-------|-------------|
| `file_name` | Audio file path |
| `id` | Unique recording ID |
| `audio` | Audio data (48 kHz WAV) |
| `transcript` | Swahili transcription |
| `gender` | Speaker gender (male/female) |
| `location` | Speaker location |
| `mother_tongue` | Native language |
| `dialect` | Regional dialect |
| `duration` | Recording length (seconds) |
| `emotions` | Emotion labels (joy, neutral, etc.) |
| `type_of_script` | free_speech / keywords / monologues |
| `script` | Original prompt |
---
## 🎤 Audio Format
- **Format:** WAV
- **Sample Rate:** 48 kHz
- **Channels:** Mono
- **Recording:** Real-world conditions (mobile devices, natural environments)
- **Quality:** Professional transcription and QA
---
## 🎯 Use Cases
- 🗣️ **Swahili ASR** — Speech recognition for East African markets
- 🔊 **Swahili TTS** — Voice synthesis for mobile apps, assistants
- 📱 **Voice apps** — M-Pesa, mobile banking, healthcare
- 🎓 **Education** — Language learning, literacy tools
- 📞 **Call centers** — Automated customer support in Swahili
- 📊 **Benchmarking** — Test multilingual model performance
- 🌍 **Inclusive AI** — Build voice AI that works for Africa
---
## ⚖️ License
**CC BY-NC 4.0** — Free for research and non-commercial use.
For **commercial licensing**, contact [sofia@silencioai.com](mailto:sofia@silencioai.com).
---
## 📧 Get the Full Dataset
Need more Swahili data? We can help.
| What You Need | We Provide |
|---------------|------------|
| More Swahili data | ✅ **9,786 hours** available |
| Other East African languages | ✅ Luganda, Kinyarwanda, Kikuyu |
| Kenyan English | ✅ 1,200+ hours |
| Tanzanian dialects | ✅ Available |
| Custom collection | ✅ Any East African language |
**📧 Email:** [sofia@silencioai.com](mailto:sofia@silencioai.com)
**🌐 Website:** [silencioai.com](https://www.silencioai.com)
---
## 🌟 Why Silencio Network?
- **1.5M+ active contributors** globally
- **180+ countries** represented
- **100+ languages** available
- **Real-world data** — Not synthetic, not scripted
- **Fast turnaround** — Custom collection in 2-4 weeks
- **Ethical sourcing** — Contributors are paid fairly
---
## Citation
```bibtex
@dataset{silencio_swahili_2025,
title = {Swahili Speech Dataset},
author = {Silencio Network},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/SilencioNetwork/swahili-speech}},
license = {CC BY-NC 4.0}
}
```
---
## Related Datasets
- [African Languages Speech Dataset](https://huggingface.co/datasets/SilencioNetwork/african-languages-speech) — Multi-language African data
- [Nigerian English Speech Dataset](https://huggingface.co/datasets/SilencioNetwork/nigerian-english-speech) — Accented English from Nigeria
- [South Asian Languages](https://huggingface.co/datasets/SilencioNetwork/south-asian-speech) — Hindi, Urdu, Bengali
---
**🚀 Building voice AI for Africa? Let's talk:** [sofia@silencioai.com](mailto:sofia@silencioai.com)
提供机构:
SilencioNetwork



