five

SilencioNetwork/yoruba-speech

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SilencioNetwork/yoruba-speech
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - yo task_categories: - automatic-speech-recognition - text-to-speech tags: - yoruba - nigerian-languages - west-africa - nigeria - tonal-language - african-languages - low-resource - speech-data - voice-ai - asr - tts pretty_name: "Yoruba Speech Dataset" dataset_info: features: - name: file_name dtype: string - name: id dtype: int64 - name: gender dtype: string - name: ethnicity dtype: string - name: occupation dtype: string - name: birth_place dtype: string - name: mother_tongue dtype: string - name: dialect dtype: string - name: year_of_birth dtype: int64 - name: years_at_birth_place dtype: int64 - name: languages_data dtype: string - name: os dtype: string - name: device dtype: string - name: browser dtype: string - name: duration dtype: float64 - name: emotions dtype: string - name: language dtype: string - name: location dtype: string - name: noise_sources dtype: string - name: script_id dtype: int64 - name: type_of_script dtype: string - name: script dtype: string - name: transcript dtype: string - name: speaker_id dtype: string configs: - config_name: yoruba_nigeria data_files: - split: free_speech path: yoruba_nigeria/free_speech/** size_categories: - n<1K --- # Yoruba Speech Dataset **The most comprehensive Yoruba speech dataset on HuggingFace - natural, real-world Yoruba from native speakers in Nigeria and the diaspora.** ## Dataset Overview - **Total audio samples**: 39 recordings - **Total duration**: ~22 minutes - **Primary region**: Nigeria (Southwest - Ibadan, Lagos) - **Context**: Natural spontaneous speech (free_speech) - **Audio format**: WAV files - **Sample rate**: 48 kHz - **License**: CC BY-NC 4.0 (free for research, non-commercial use) ## Language Context **Yoruba (Èdè Yorùbá)** is one of Africa's major languages: - **Speakers**: 45M+ native speakers - **Geographic spread**: Nigeria (Southwest), Benin, Togo, diaspora (UK, US, Brazil) - **Tonal language**: 3 tones (high, mid, low) - essential for meaning - **Niger-Congo family**: Closely related to Igbo, Edo - **Cultural significance**: Rich oral literature, proverbs, music tradition (Afrobeat) - **Digital presence**: Growing use in social media, YouTube, voice apps ## Target Applications This dataset is designed for: - **Yoruba ASR systems** - Speech recognition for 45M+ speakers - **Voice assistants** - Banking, customer service, government services in Nigeria - **TTS for Yoruba** - Text-to-speech with authentic Nigerian pronunciation - **Language learning apps** - Pronunciation training (including tonal patterns) - **Content moderation** - Social media platforms operating in Nigeria - **Cultural preservation** - Digitizing Yoruba oral traditions, music, stories ## Dataset Structure ``` yoruba-speech/ └── data/ ├── audio/ # 39 WAV files └── metadata.csv # Speaker metadata & transcripts ``` ## Data Splits ### Yoruba (Nigeria) - **Files**: 39 recordings - **Dialect**: Primarily Standard Yoruba (Ibadan variant) - **Context**: Natural spontaneous speech - **Use case**: General-purpose Yoruba ASR, Nigerian voice AI ## Languages Sampled in This Dataset ✅ 39 audio samples available for immediate download: - **Yoruba**: 39 files (~22 minutes) ## Full OTS Inventory Available 📊 This sample represents **<0.08%** of Silencio's complete Yoruba speech inventory. Contact us for access to our full Yoruba corpus: **Yoruba by Country:** - **Nigeria**: 1,884 hours, 220,362 recordings - **United States**: 9 hours, 977 recordings - **Benin**: 6 hours, 875 recordings - **United Kingdom**: 3 hours, 335 recordings - **Ghana**: 3 hours, 403 recordings - **Kenya**: 2 hours, 295 recordings - **American Samoa**: 1 hour, 191 recordings - **Niger**: 1 hour, 173 recordings - **Andorra**: 1 hour, 161 recordings - **+ 15 more countries** (diaspora communities) **Total**: **1,917+ hours** across **224,000+ recordings** **Contact us for access**: [sofia@silencioai.com](mailto:sofia@silencioai.com) ## Key Features ✅ **Native speakers** - Authentic Nigerian Yoruba (Southwest region) ✅ **Natural speech** - Real conversational Yoruba, not scripted ✅ **Tonal language** - Captures high, mid, low tone distinctions ✅ **Diverse topics** - Daily life, opinions, cultural topics ✅ **Standard dialect** - Ibadan variant (widely understood) ✅ **High audio quality** - 48 kHz WAV format ✅ **Rich metadata** - Gender, dialect, emotions, transcriptions ✅ **Ethical data collection** - Consent-based, privacy-preserving ## Use Cases ### 1. Yoruba Speech Recognition Build ASR systems for the 45M+ Yoruba-speaking market in Nigeria, Benin, Togo, and the diaspora. ### 2. Voice Banking & Fintech Power voice-enabled banking apps and financial services in Southwest Nigeria (Lagos, Ibadan, Abeokuta). ### 3. Yoruba TTS Train text-to-speech models with authentic Nigerian Yoruba pronunciation and tonal patterns. ### 4. Content Moderation Build speech detection for Nigerian social media platforms (Nairaland, local Facebook groups). ### 5. Language Learning Develop pronunciation training tools for Yoruba learners (especially tone recognition). ### 6. Cultural Preservation Digitize Yoruba oral traditions, proverbs (òwe), folk stories (ìtàn), and music. ## Loading the Dataset ```python from datasets import load_dataset # Load full Yoruba dataset dataset = load_dataset("SilencioNetwork/yoruba-speech") # Access samples for sample in dataset['train']: audio = sample['audio'] transcript = sample['transcript'] dialect = sample['dialect'] print(f"Transcript: {transcript}") print(f"Dialect: {dialect}") ``` ## Sample Metadata Each recording includes: - `file_name`: Audio file path - `id`: Unique recording ID - `gender`: Speaker gender - `location`: Speaker location - `mother_tongue`: Native language (Yoruba) - `dialect`: Regional variant (Nigeria - Ibadan) - `duration`: Recording length (seconds) - `emotions`: Emotion labels (focused, relaxed, excited, etc.) - `language`: Yoruba - `type_of_script`: free_speech (spontaneous, unscripted) - `transcript`: Whisper-generated transcription (Yoruba text) - `script`: Original prompt (question asked in Yoruba) ## Yoruba Speech Characteristics This dataset captures authentic Yoruba speech features: - **Tonal phonology**: 3-tone system (high, mid, low) - critical for word meaning - **Vowel harmony**: ATR (advanced tongue root) harmony patterns - **Nasal consonants**: Distinctive nasalization - **Syllable structure**: Primarily CV (consonant-vowel) - **Natural prosody**: Authentic rhythm, stress, intonation - **Real-world audio**: Mobile recordings, natural environments ## Market Context ### Nigerian Tech & Economy - **45M+ Yoruba speakers** - One of Nigeria's 3 major languages - **Southwest Nigeria**: Economic powerhouse (Lagos = 24M population) - **Lagos GDP**: $136B (larger than 30+ African countries) - **Growing smartphone penetration**: 60%+ in Southwest Nigeria - **Digital payment revolution**: Voice commands for fintech emerging - **YouTube/TikTok**: Growing Yoruba content ecosystem ### Why Yoruba Matters - **Underrepresented in AI**: <0.1% of speech datasets despite 45M+ speakers - **High commercial value**: Banking, telecom, e-commerce in Southwest Nigeria - **Cultural significance**: Rich oral tradition, Afrobeat music (Fela Kuti, Burna Boy) - **Government use**: Lagos State increasingly using Yoruba in public services - **Diaspora market**: Large communities in UK, US, Brazil ## Tonal Language Considerations Yoruba is a **tonal language** - pitch changes word meaning: - **oko** (high-mid) = husband - **òkò** (mid-low) = hoe (farming tool) - **ọkọ̀** (mid-low with mid vowel) = vehicle ASR/TTS systems need to capture these tone distinctions for accurate Yoruba processing. ## Ethical Considerations All data was collected with explicit informed consent from native Yoruba speakers. Recordings contain general conversational topics only - no sensitive personal information. ## Comparison to Other Datasets | Dataset | Language | Hours | Speakers | Natural? | |---------|----------|-------|----------|----------| | LibriSpeech | English | 1,000 | 2,484 | ❌ Read speech | | Common Voice | Yoruba | ~5 | Few | ⚠️ Read sentences | | **Silencio Yoruba** | **Yoruba** | **1,917+** | **10,000+** | **✅ Spontaneous** | **This is the largest natural Yoruba speech dataset available.** ## Citation If you use this dataset in your research or commercial product, please cite: ```bibtex @dataset{silencio_yoruba_speech_2026, title={Yoruba Speech Dataset}, author={Silencio Network}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/SilencioNetwork/yoruba-speech} } ``` ## Related Datasets - [African Languages Speech](https://huggingface.co/datasets/SilencioNetwork/african-languages-speech) - 6 African languages (Swahili, Hausa, Yoruba, Igbo, Amharic, Nigerian English) - [Complete Voice AI Speech Dataset](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset) - 39 language/accent variants - [Indian Languages Speech](https://huggingface.co/datasets/SilencioNetwork/indian-languages-speech) - 9 Indian languages - [Global English Accents Speech](https://huggingface.co/datasets/SilencioNetwork/global-english-accents-speech) - 20 English accent variants ## License **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International) ✅ Free for research and non-commercial use ❌ Commercial use requires licensing (contact us) ## About Silencio Silencio is a voice AI data sourcing company with 2M+ contributors across 180+ countries. We provide scaled sourcing of real-world audio and speech data for AI labs, robotics companies, and enterprises building voice AI products. 🌐 [silencioai.com](https://www.silencioai.com) 📧 [sofia@silencioai.com](mailto:sofia@silencioai.com) --- **Tags**: yoruba, edé yorùbá, nigerian languages, west africa, nigeria, tonal language, african languages, low-resource languages, speech recognition, asr, tts, voice ai, natural speech, spontaneous speech, nigerian speech, lagos, ibadan
提供机构:
SilencioNetwork
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作