five

SilencioNetwork/swahili-speech

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SilencioNetwork/swahili-speech
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - sw task_categories: - automatic-speech-recognition - audio-classification - text-to-speech tags: - swahili - kiswahili - east-africa - kenya - tanzania - uganda - rwanda - burundi - drc - african-languages - low-resource - speech-data - voice-ai - asr - tts - africa pretty_name: "🇰🇪 Swahili Speech Dataset" dataset_info: features: - name: file_name dtype: string - name: id dtype: int64 - name: gender dtype: string - name: ethnicity dtype: string - name: occupation dtype: string - name: birth_place dtype: string - name: mother_tongue dtype: string - name: dialect dtype: string - name: year_of_birth dtype: int64 - name: years_at_birth_place dtype: int64 - name: languages_data dtype: string - name: os dtype: string - name: device dtype: string - name: browser dtype: string - name: duration dtype: float64 - name: emotions dtype: string - name: language dtype: string - name: location dtype: string - name: noise_sources dtype: string - name: script_id dtype: int64 - name: type_of_script dtype: string - name: script dtype: string - name: transcript dtype: string - name: speaker_id dtype: string configs: - config_name: swahili_kenya data_files: - split: free_speech path: swahili_kenya/free_speech/** size_categories: - n<1K --- # 🇰🇪 Swahili Speech Dataset <p align="left"> <img src="https://cdn-uploads.huggingface.co/production/uploads/69162b50b89e7abe20de4b5a/LWhs4p2lPFcyiVsP0tluu.png" width="40%"> </p> [![Website](https://img.shields.io/badge/Website-silencioai.com-blue?style=flat-square)](https://www.silencioai.com) [![Contact](https://img.shields.io/badge/Contact-sofia@silencioai.com-green?style=flat-square)](mailto:sofia@silencioai.com) [![Data Available](https://img.shields.io/badge/Full_Corpus-9,786_hours-orange?style=flat-square)](mailto:sofia@silencioai.com) --- > **🌍 Swahili — The lingua franca of East Africa.** > > Spoken by **200+ million people** across Kenya, Tanzania, Uganda, Rwanda, Burundi, and the DRC. > > **📧 Need more?** [sofia@silencioai.com](mailto:sofia@silencioai.com) — we have **9,786 hours** of Swahili voice data. --- ## 🎯 Dataset Overview **47 high-quality Swahili recordings** (~21 minutes) from native speakers across East Africa. | Language | Speakers | Regions | Sample Size | |----------|----------|---------|-------------| | 🇰🇪 **Kiswahili** | Native speakers | Kenya, Tanzania, Uganda | **47 recordings** | ### Speaker Demographics - **Gender balance:** Mixed male/female - **Regions:** Kenyan Swahili (Nairobi, Mombasa), Tanzanian Swahili - **Ages:** 18-60+ - **Recording quality:** Real-world mobile recordings, natural speech --- ## 🚀 Quick Start ```python from datasets import load_dataset # Load dataset swahili = load_dataset("SilencioNetwork/swahili-speech") # Process samples for sample in swahili['train']: audio = sample['audio'] transcript = sample['transcript'] gender = sample['gender'] print(f"[{gender}] {transcript[:50]}...") ``` --- ## 🌍 Why Swahili? Swahili (Kiswahili) is one of Africa's most important languages: - 🗣️ **200+ million speakers** across East and Central Africa - 🇰🇪 **Official language** of Kenya, Tanzania, Uganda, Rwanda - 💼 **Growing digital economy** — mobile banking, e-commerce booming - 📱 **Tech adoption** — M-Pesa, voice AI demand rising - 🌐 **Pan-African lingua franca** — Used across 10+ countries Yet Swahili remains **severely underrepresented** in voice AI datasets. --- ## 📊 Full Data Availability This sample is **<1%** of our Swahili corpus. | Category | This Sample | Full Corpus Available | |----------|-------------|----------------------| | **Swahili (Kenya)** | 47 recordings (~21 min) | **9,786 hours** | | **Total** | **47** | **9,786 hours** | ### What We Have - ✅ **9,786 hours** of Swahili voice data - ✅ Native speakers from Kenya, Tanzania, Uganda - ✅ Multiple dialects and accents - ✅ Real-world recording conditions - ✅ Transcriptions available - ✅ Rich metadata (gender, age, region, emotion) --- ## 📋 Metadata Each recording includes: | Field | Description | |-------|-------------| | `file_name` | Audio file path | | `id` | Unique recording ID | | `audio` | Audio data (48 kHz WAV) | | `transcript` | Swahili transcription | | `gender` | Speaker gender (male/female) | | `location` | Speaker location | | `mother_tongue` | Native language | | `dialect` | Regional dialect | | `duration` | Recording length (seconds) | | `emotions` | Emotion labels (joy, neutral, etc.) | | `type_of_script` | free_speech / keywords / monologues | | `script` | Original prompt | --- ## 🎤 Audio Format - **Format:** WAV - **Sample Rate:** 48 kHz - **Channels:** Mono - **Recording:** Real-world conditions (mobile devices, natural environments) - **Quality:** Professional transcription and QA --- ## 🎯 Use Cases - 🗣️ **Swahili ASR** — Speech recognition for East African markets - 🔊 **Swahili TTS** — Voice synthesis for mobile apps, assistants - 📱 **Voice apps** — M-Pesa, mobile banking, healthcare - 🎓 **Education** — Language learning, literacy tools - 📞 **Call centers** — Automated customer support in Swahili - 📊 **Benchmarking** — Test multilingual model performance - 🌍 **Inclusive AI** — Build voice AI that works for Africa --- ## ⚖️ License **CC BY-NC 4.0** — Free for research and non-commercial use. For **commercial licensing**, contact [sofia@silencioai.com](mailto:sofia@silencioai.com). --- ## 📧 Get the Full Dataset Need more Swahili data? We can help. | What You Need | We Provide | |---------------|------------| | More Swahili data | ✅ **9,786 hours** available | | Other East African languages | ✅ Luganda, Kinyarwanda, Kikuyu | | Kenyan English | ✅ 1,200+ hours | | Tanzanian dialects | ✅ Available | | Custom collection | ✅ Any East African language | **📧 Email:** [sofia@silencioai.com](mailto:sofia@silencioai.com) **🌐 Website:** [silencioai.com](https://www.silencioai.com) --- ## 🌟 Why Silencio Network? - **1.5M+ active contributors** globally - **180+ countries** represented - **100+ languages** available - **Real-world data** — Not synthetic, not scripted - **Fast turnaround** — Custom collection in 2-4 weeks - **Ethical sourcing** — Contributors are paid fairly --- ## Citation ```bibtex @dataset{silencio_swahili_2025, title = {Swahili Speech Dataset}, author = {Silencio Network}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/SilencioNetwork/swahili-speech}}, license = {CC BY-NC 4.0} } ``` --- ## Related Datasets - [African Languages Speech Dataset](https://huggingface.co/datasets/SilencioNetwork/african-languages-speech) — Multi-language African data - [Nigerian English Speech Dataset](https://huggingface.co/datasets/SilencioNetwork/nigerian-english-speech) — Accented English from Nigeria - [South Asian Languages](https://huggingface.co/datasets/SilencioNetwork/south-asian-speech) — Hindi, Urdu, Bengali --- **🚀 Building voice AI for Africa? Let's talk:** [sofia@silencioai.com](mailto:sofia@silencioai.com)
提供机构:
SilencioNetwork
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作