five

SilencioNetwork/hausa-speech

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SilencioNetwork/hausa-speech
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - ha task_categories: - automatic-speech-recognition - text-to-speech tags: - hausa - nigerian-languages - west-africa - nigeria - niger - african-languages - low-resource - speech-data - voice-ai - asr - tts pretty_name: "Hausa Speech Dataset" dataset_info: features: - name: file_name dtype: string - name: id dtype: int64 - name: gender dtype: string - name: ethnicity dtype: string - name: occupation dtype: string - name: birth_place dtype: string - name: mother_tongue dtype: string - name: dialect dtype: string - name: year_of_birth dtype: int64 - name: years_at_birth_place dtype: int64 - name: languages_data dtype: string - name: os dtype: string - name: device dtype: string - name: browser dtype: string - name: duration dtype: float64 - name: emotions dtype: string - name: language dtype: string - name: location dtype: string - name: noise_sources dtype: string - name: script_id dtype: int64 - name: type_of_script dtype: string - name: script dtype: string - name: transcript dtype: string - name: speaker_id dtype: string configs: - config_name: hausa_nigeria data_files: - split: free_speech path: hausa_nigeria/free_speech/** size_categories: - n<1K --- # Hausa Speech Dataset **The most comprehensive Hausa speech dataset on HuggingFace - natural, real-world Hausa from native speakers across West Africa.** ## Dataset Overview - **Total audio samples**: 42 recordings - **Total duration**: ~25 minutes - **Primary region**: Nigeria (Kano, Northern Nigeria) - **Context**: Natural spontaneous speech (free_speech) - **Audio format**: WAV files - **Sample rate**: 48 kHz - **License**: CC BY-NC 4.0 (free for research, non-commercial use) ## Language Context **Hausa (هَرْشٜن هَوْس)** is one of Africa's most widely spoken languages: - **Speakers**: 80M+ (50M native, 30M+ L2) - **Geographic spread**: Nigeria (Northern states), Niger, Ghana, Chad, Cameroon, diaspora - **Trade language**: West African lingua franca - **Writing system**: Latin (Boko) and Arabic (Ajami) scripts - **Linguistic family**: Afro-Asiatic (Chadic branch) - **Cultural significance**: Hausa literature, Nollywood (Kannywood), Islamic scholarship - **Digital presence**: Growing on social media, YouTube, Nigerian tech ecosystem ## Target Applications This dataset is designed for: - **Hausa ASR systems** - Speech recognition for 80M+ speakers - **Voice assistants** - Nigerian tech startups, mobile banking - **TTS for Hausa** - Text-to-speech with authentic Northern Nigerian pronunciation - **Language learning apps** - Pronunciation training for Hausa learners - **Content moderation** - Social media platforms operating in Nigeria/Niger - **Transcription services** - Kannywood (Hausa film industry), radio, podcasts - **Banking & fintech** - Voice-enabled banking in Northern Nigeria ## Dataset Structure ``` hausa-speech/ └── data/ ├── audio/ # 42 WAV files └── metadata.csv # Speaker metadata & transcripts ``` ## Data Splits ### Hausa (Nigeria) - **Files**: 42 recordings - **Dialect**: Primarily Kano Hausa (standard/prestige dialect) - **Context**: Natural spontaneous speech - **Use case**: General-purpose Hausa ASR, West African voice AI ## Languages Sampled in This Dataset ✅ 42 audio samples available for immediate download: - **Hausa**: 42 files (~25 minutes) ## Full OTS Inventory Available 📊 This sample represents **<0.16%** of Silencio's complete Hausa speech inventory. Contact us for access to our full Hausa corpus: **Hausa by Country:** - **Nigeria**: 964 hours, 124,481 recordings - **United States**: 5 hours, 1,052 recordings - **United States Minor Outlying Islands**: 3 hours, 273 recordings - **South Africa**: 2 hours, 192 recordings - **Algeria**: 2 hours, 230 recordings - **American Samoa**: 1 hour, 142 recordings - **Niger**: 1 hour, 135 recordings - **Uganda**: 1 hour, 50 recordings - **+ 10 more countries** (diaspora communities) **Total**: **981+ hours** across **127,000+ recordings** **Contact us for access**: [sofia@silencioai.com](mailto:sofia@silencioai.com) ## Key Features ✅ **Native speakers** - Authentic Nigerian Hausa (Kano dialect) ✅ **Natural speech** - Real conversational Hausa, not scripted ✅ **Standard dialect** - Kano variant (prestige/most widely understood) ✅ **Diverse topics** - Daily life, opinions, business, culture ✅ **High audio quality** - 48 kHz WAV format ✅ **Rich metadata** - Gender, dialect, emotions, transcriptions in Latin script ✅ **Ethical data collection** - Consent-based, privacy-preserving ## Use Cases ### 1. Hausa Speech Recognition Build ASR systems for the 80M+ Hausa-speaking market in Nigeria, Niger, and West Africa. ### 2. Voice Banking & Fintech Power voice-enabled mobile banking in Northern Nigeria (Kano, Kaduna, Katsina). ### 3. Hausa TTS Train text-to-speech models with authentic Kano/Northern Nigerian Hausa pronunciation. ### 4. Content Moderation Build speech detection for Nigerian social media platforms and Kannywood content. ### 5. Kannywood & Media Improve automatic transcription for Hausa films, radio, podcasts (Kannywood is a major industry). ### 6. Voice Assistants Develop Hausa-language voice assistants for Northern Nigeria's growing smartphone market. ## Loading the Dataset ```python from datasets import load_dataset # Load full Hausa dataset dataset = load_dataset("SilencioNetwork/hausa-speech") # Access samples for sample in dataset['train']: audio = sample['audio'] transcript = sample['transcript'] dialect = sample['dialect'] print(f"Transcript: {transcript}") print(f"Dialect: {dialect}") ``` ## Sample Metadata Each recording includes: - `file_name`: Audio file path - `id`: Unique recording ID - `gender`: Speaker gender - `location`: Speaker location - `mother_tongue`: Native language (Hausa) - `dialect`: Regional variant (Nigeria - Kano) - `duration`: Recording length (seconds) - `emotions`: Emotion labels (happy, tired, relaxed, etc.) - `language`: Hausa - `type_of_script`: free_speech (spontaneous, unscripted) - `transcript`: Whisper-generated transcription (Latin/Boko script) - `script`: Original prompt (question asked in Hausa) ## Hausa Speech Characteristics This dataset captures authentic Hausa speech features: - **Tonal language**: 2-3 tone system (high, low, falling) - **Ejective consonants**: ƙ, ɓ, ɗ (unique to Hausa) - **Vowel length distinction**: Short vs long vowels change meaning - **Arabic loanwords**: Rich Islamic vocabulary - **Code-switching**: Natural mixing with English (Nigeria), French (Niger) - **Natural prosody**: Authentic rhythm, stress, intonation - **Real-world audio**: Mobile recordings, natural environments ## Market Context ### West African Tech & Economy - **80M+ Hausa speakers** - One of Africa's top 5 languages - **Northern Nigeria**: 100M+ population, major economic zone - **Kano**: 4M population, commercial capital of Northern Nigeria - **Smartphone penetration**: Growing 20%+ annually in Northern Nigeria - **Kannywood**: Billion-dollar Hausa film industry (rival to Nollywood) - **BBC Hausa**: Major news service reaching millions - **Islamic scholarship**: Rich tradition of Hausa-language education ### Why Hausa Matters - **Underrepresented in AI**: <0.1% of speech datasets despite 80M+ speakers - **Trade language**: West African lingua franca (Nigeria, Niger, Ghana, Chad) - **Large market**: Northern Nigeria's economy larger than many African countries - **Cultural influence**: Hausa music, literature, film reaching across West Africa - **Growing digital economy**: E-commerce, fintech, edtech emerging in Kano/Kaduna ## Hausa Dialects **Kano Hausa** (represented in this dataset) is the **prestige dialect**: - Most widely understood across Hausa-speaking regions - Used in media (BBC Hausa, VOA Hausa, Radio France International) - Taught in schools - Standard for written Hausa Other major dialects: Sokoto, Katsina, Zaria (all mutually intelligible) ## Ethical Considerations All data was collected with explicit informed consent from native Hausa speakers. Recordings contain general conversational topics only - no sensitive personal information. ## Comparison to Other Datasets | Dataset | Language | Hours | Speakers | Natural? | |---------|----------|-------|----------|----------| | LibriSpeech | English | 1,000 | 2,484 | ❌ Read speech | | Common Voice | Hausa | ~10 | Few | ⚠️ Read sentences | | **Silencio Hausa** | **Hausa** | **981+** | **3,500+** | **✅ Spontaneous** | **This is the largest natural Hausa speech dataset available.** ## Citation If you use this dataset in your research or commercial product, please cite: ```bibtex @dataset{silencio_hausa_speech_2026, title={Hausa Speech Dataset}, author={Silencio Network}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/SilencioNetwork/hausa-speech} } ``` ## Related Datasets - [African Languages Speech](https://huggingface.co/datasets/SilencioNetwork/african-languages-speech) - 6 African languages (Swahili, Hausa, Yoruba, Igbo, Amharic, Nigerian English) - [Yoruba Speech](https://huggingface.co/datasets/SilencioNetwork/yoruba-speech) - 50 Yoruba samples (fellow Nigerian language) - [Complete Voice AI Speech Dataset](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset) - 39 language/accent variants - [Indian Languages Speech](https://huggingface.co/datasets/SilencioNetwork/indian-languages-speech) - 9 Indian languages ## License **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International) ✅ Free for research and non-commercial use ❌ Commercial use requires licensing (contact us) ## About Silencio Silencio is a voice AI data sourcing company with 2M+ contributors across 180+ countries. We provide scaled sourcing of real-world audio and speech data for AI labs, robotics companies, and enterprises building voice AI products. 🌐 [silencioai.com](https://www.silencioai.com) 📧 [sofia@silencioai.com](mailto:sofia@silencioai.com) --- **Tags**: hausa, hausa language, nigerian languages, west africa, nigeria, niger, kano, kannywood, african languages, low-resource languages, speech recognition, asr, tts, voice ai, natural speech, spontaneous speech, nigerian speech, chadic languages
提供机构:
SilencioNetwork
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作