SilencioNetwork/yoruba-speech
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SilencioNetwork/yoruba-speech
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- yo
task_categories:
- automatic-speech-recognition
- text-to-speech
tags:
- yoruba
- nigerian-languages
- west-africa
- nigeria
- tonal-language
- african-languages
- low-resource
- speech-data
- voice-ai
- asr
- tts
pretty_name: "Yoruba Speech Dataset"
dataset_info:
features:
- name: file_name
dtype: string
- name: id
dtype: int64
- name: gender
dtype: string
- name: ethnicity
dtype: string
- name: occupation
dtype: string
- name: birth_place
dtype: string
- name: mother_tongue
dtype: string
- name: dialect
dtype: string
- name: year_of_birth
dtype: int64
- name: years_at_birth_place
dtype: int64
- name: languages_data
dtype: string
- name: os
dtype: string
- name: device
dtype: string
- name: browser
dtype: string
- name: duration
dtype: float64
- name: emotions
dtype: string
- name: language
dtype: string
- name: location
dtype: string
- name: noise_sources
dtype: string
- name: script_id
dtype: int64
- name: type_of_script
dtype: string
- name: script
dtype: string
- name: transcript
dtype: string
- name: speaker_id
dtype: string
configs:
- config_name: yoruba_nigeria
data_files:
- split: free_speech
path: yoruba_nigeria/free_speech/**
size_categories:
- n<1K
---
# Yoruba Speech Dataset
**The most comprehensive Yoruba speech dataset on HuggingFace - natural, real-world Yoruba from native speakers in Nigeria and the diaspora.**
## Dataset Overview
- **Total audio samples**: 39 recordings
- **Total duration**: ~22 minutes
- **Primary region**: Nigeria (Southwest - Ibadan, Lagos)
- **Context**: Natural spontaneous speech (free_speech)
- **Audio format**: WAV files
- **Sample rate**: 48 kHz
- **License**: CC BY-NC 4.0 (free for research, non-commercial use)
## Language Context
**Yoruba (Èdè Yorùbá)** is one of Africa's major languages:
- **Speakers**: 45M+ native speakers
- **Geographic spread**: Nigeria (Southwest), Benin, Togo, diaspora (UK, US, Brazil)
- **Tonal language**: 3 tones (high, mid, low) - essential for meaning
- **Niger-Congo family**: Closely related to Igbo, Edo
- **Cultural significance**: Rich oral literature, proverbs, music tradition (Afrobeat)
- **Digital presence**: Growing use in social media, YouTube, voice apps
## Target Applications
This dataset is designed for:
- **Yoruba ASR systems** - Speech recognition for 45M+ speakers
- **Voice assistants** - Banking, customer service, government services in Nigeria
- **TTS for Yoruba** - Text-to-speech with authentic Nigerian pronunciation
- **Language learning apps** - Pronunciation training (including tonal patterns)
- **Content moderation** - Social media platforms operating in Nigeria
- **Cultural preservation** - Digitizing Yoruba oral traditions, music, stories
## Dataset Structure
```
yoruba-speech/
└── data/
├── audio/ # 39 WAV files
└── metadata.csv # Speaker metadata & transcripts
```
## Data Splits
### Yoruba (Nigeria)
- **Files**: 39 recordings
- **Dialect**: Primarily Standard Yoruba (Ibadan variant)
- **Context**: Natural spontaneous speech
- **Use case**: General-purpose Yoruba ASR, Nigerian voice AI
## Languages Sampled in This Dataset ✅
39 audio samples available for immediate download:
- **Yoruba**: 39 files (~22 minutes)
## Full OTS Inventory Available 📊
This sample represents **<0.08%** of Silencio's complete Yoruba speech inventory.
Contact us for access to our full Yoruba corpus:
**Yoruba by Country:**
- **Nigeria**: 1,884 hours, 220,362 recordings
- **United States**: 9 hours, 977 recordings
- **Benin**: 6 hours, 875 recordings
- **United Kingdom**: 3 hours, 335 recordings
- **Ghana**: 3 hours, 403 recordings
- **Kenya**: 2 hours, 295 recordings
- **American Samoa**: 1 hour, 191 recordings
- **Niger**: 1 hour, 173 recordings
- **Andorra**: 1 hour, 161 recordings
- **+ 15 more countries** (diaspora communities)
**Total**: **1,917+ hours** across **224,000+ recordings**
**Contact us for access**: [sofia@silencioai.com](mailto:sofia@silencioai.com)
## Key Features
✅ **Native speakers** - Authentic Nigerian Yoruba (Southwest region)
✅ **Natural speech** - Real conversational Yoruba, not scripted
✅ **Tonal language** - Captures high, mid, low tone distinctions
✅ **Diverse topics** - Daily life, opinions, cultural topics
✅ **Standard dialect** - Ibadan variant (widely understood)
✅ **High audio quality** - 48 kHz WAV format
✅ **Rich metadata** - Gender, dialect, emotions, transcriptions
✅ **Ethical data collection** - Consent-based, privacy-preserving
## Use Cases
### 1. Yoruba Speech Recognition
Build ASR systems for the 45M+ Yoruba-speaking market in Nigeria, Benin, Togo, and the diaspora.
### 2. Voice Banking & Fintech
Power voice-enabled banking apps and financial services in Southwest Nigeria (Lagos, Ibadan, Abeokuta).
### 3. Yoruba TTS
Train text-to-speech models with authentic Nigerian Yoruba pronunciation and tonal patterns.
### 4. Content Moderation
Build speech detection for Nigerian social media platforms (Nairaland, local Facebook groups).
### 5. Language Learning
Develop pronunciation training tools for Yoruba learners (especially tone recognition).
### 6. Cultural Preservation
Digitize Yoruba oral traditions, proverbs (òwe), folk stories (ìtàn), and music.
## Loading the Dataset
```python
from datasets import load_dataset
# Load full Yoruba dataset
dataset = load_dataset("SilencioNetwork/yoruba-speech")
# Access samples
for sample in dataset['train']:
audio = sample['audio']
transcript = sample['transcript']
dialect = sample['dialect']
print(f"Transcript: {transcript}")
print(f"Dialect: {dialect}")
```
## Sample Metadata
Each recording includes:
- `file_name`: Audio file path
- `id`: Unique recording ID
- `gender`: Speaker gender
- `location`: Speaker location
- `mother_tongue`: Native language (Yoruba)
- `dialect`: Regional variant (Nigeria - Ibadan)
- `duration`: Recording length (seconds)
- `emotions`: Emotion labels (focused, relaxed, excited, etc.)
- `language`: Yoruba
- `type_of_script`: free_speech (spontaneous, unscripted)
- `transcript`: Whisper-generated transcription (Yoruba text)
- `script`: Original prompt (question asked in Yoruba)
## Yoruba Speech Characteristics
This dataset captures authentic Yoruba speech features:
- **Tonal phonology**: 3-tone system (high, mid, low) - critical for word meaning
- **Vowel harmony**: ATR (advanced tongue root) harmony patterns
- **Nasal consonants**: Distinctive nasalization
- **Syllable structure**: Primarily CV (consonant-vowel)
- **Natural prosody**: Authentic rhythm, stress, intonation
- **Real-world audio**: Mobile recordings, natural environments
## Market Context
### Nigerian Tech & Economy
- **45M+ Yoruba speakers** - One of Nigeria's 3 major languages
- **Southwest Nigeria**: Economic powerhouse (Lagos = 24M population)
- **Lagos GDP**: $136B (larger than 30+ African countries)
- **Growing smartphone penetration**: 60%+ in Southwest Nigeria
- **Digital payment revolution**: Voice commands for fintech emerging
- **YouTube/TikTok**: Growing Yoruba content ecosystem
### Why Yoruba Matters
- **Underrepresented in AI**: <0.1% of speech datasets despite 45M+ speakers
- **High commercial value**: Banking, telecom, e-commerce in Southwest Nigeria
- **Cultural significance**: Rich oral tradition, Afrobeat music (Fela Kuti, Burna Boy)
- **Government use**: Lagos State increasingly using Yoruba in public services
- **Diaspora market**: Large communities in UK, US, Brazil
## Tonal Language Considerations
Yoruba is a **tonal language** - pitch changes word meaning:
- **oko** (high-mid) = husband
- **òkò** (mid-low) = hoe (farming tool)
- **ọkọ̀** (mid-low with mid vowel) = vehicle
ASR/TTS systems need to capture these tone distinctions for accurate Yoruba processing.
## Ethical Considerations
All data was collected with explicit informed consent from native Yoruba speakers. Recordings contain general conversational topics only - no sensitive personal information.
## Comparison to Other Datasets
| Dataset | Language | Hours | Speakers | Natural? |
|---------|----------|-------|----------|----------|
| LibriSpeech | English | 1,000 | 2,484 | ❌ Read speech |
| Common Voice | Yoruba | ~5 | Few | ⚠️ Read sentences |
| **Silencio Yoruba** | **Yoruba** | **1,917+** | **10,000+** | **✅ Spontaneous** |
**This is the largest natural Yoruba speech dataset available.**
## Citation
If you use this dataset in your research or commercial product, please cite:
```bibtex
@dataset{silencio_yoruba_speech_2026,
title={Yoruba Speech Dataset},
author={Silencio Network},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/SilencioNetwork/yoruba-speech}
}
```
## Related Datasets
- [African Languages Speech](https://huggingface.co/datasets/SilencioNetwork/african-languages-speech) - 6 African languages (Swahili, Hausa, Yoruba, Igbo, Amharic, Nigerian English)
- [Complete Voice AI Speech Dataset](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset) - 39 language/accent variants
- [Indian Languages Speech](https://huggingface.co/datasets/SilencioNetwork/indian-languages-speech) - 9 Indian languages
- [Global English Accents Speech](https://huggingface.co/datasets/SilencioNetwork/global-english-accents-speech) - 20 English accent variants
## License
**CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International)
✅ Free for research and non-commercial use
❌ Commercial use requires licensing (contact us)
## About Silencio
Silencio is a voice AI data sourcing company with 2M+ contributors across 180+ countries. We provide scaled sourcing of real-world audio and speech data for AI labs, robotics companies, and enterprises building voice AI products.
🌐 [silencioai.com](https://www.silencioai.com)
📧 [sofia@silencioai.com](mailto:sofia@silencioai.com)
---
**Tags**: yoruba, edé yorùbá, nigerian languages, west africa, nigeria, tonal language, african languages, low-resource languages, speech recognition, asr, tts, voice ai, natural speech, spontaneous speech, nigerian speech, lagos, ibadan
提供机构:
SilencioNetwork



