SilencioNetwork/hausa-speech
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SilencioNetwork/hausa-speech
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- ha
task_categories:
- automatic-speech-recognition
- text-to-speech
tags:
- hausa
- nigerian-languages
- west-africa
- nigeria
- niger
- african-languages
- low-resource
- speech-data
- voice-ai
- asr
- tts
pretty_name: "Hausa Speech Dataset"
dataset_info:
features:
- name: file_name
dtype: string
- name: id
dtype: int64
- name: gender
dtype: string
- name: ethnicity
dtype: string
- name: occupation
dtype: string
- name: birth_place
dtype: string
- name: mother_tongue
dtype: string
- name: dialect
dtype: string
- name: year_of_birth
dtype: int64
- name: years_at_birth_place
dtype: int64
- name: languages_data
dtype: string
- name: os
dtype: string
- name: device
dtype: string
- name: browser
dtype: string
- name: duration
dtype: float64
- name: emotions
dtype: string
- name: language
dtype: string
- name: location
dtype: string
- name: noise_sources
dtype: string
- name: script_id
dtype: int64
- name: type_of_script
dtype: string
- name: script
dtype: string
- name: transcript
dtype: string
- name: speaker_id
dtype: string
configs:
- config_name: hausa_nigeria
data_files:
- split: free_speech
path: hausa_nigeria/free_speech/**
size_categories:
- n<1K
---
# Hausa Speech Dataset
**The most comprehensive Hausa speech dataset on HuggingFace - natural, real-world Hausa from native speakers across West Africa.**
## Dataset Overview
- **Total audio samples**: 42 recordings
- **Total duration**: ~25 minutes
- **Primary region**: Nigeria (Kano, Northern Nigeria)
- **Context**: Natural spontaneous speech (free_speech)
- **Audio format**: WAV files
- **Sample rate**: 48 kHz
- **License**: CC BY-NC 4.0 (free for research, non-commercial use)
## Language Context
**Hausa (هَرْشٜن هَوْس)** is one of Africa's most widely spoken languages:
- **Speakers**: 80M+ (50M native, 30M+ L2)
- **Geographic spread**: Nigeria (Northern states), Niger, Ghana, Chad, Cameroon, diaspora
- **Trade language**: West African lingua franca
- **Writing system**: Latin (Boko) and Arabic (Ajami) scripts
- **Linguistic family**: Afro-Asiatic (Chadic branch)
- **Cultural significance**: Hausa literature, Nollywood (Kannywood), Islamic scholarship
- **Digital presence**: Growing on social media, YouTube, Nigerian tech ecosystem
## Target Applications
This dataset is designed for:
- **Hausa ASR systems** - Speech recognition for 80M+ speakers
- **Voice assistants** - Nigerian tech startups, mobile banking
- **TTS for Hausa** - Text-to-speech with authentic Northern Nigerian pronunciation
- **Language learning apps** - Pronunciation training for Hausa learners
- **Content moderation** - Social media platforms operating in Nigeria/Niger
- **Transcription services** - Kannywood (Hausa film industry), radio, podcasts
- **Banking & fintech** - Voice-enabled banking in Northern Nigeria
## Dataset Structure
```
hausa-speech/
└── data/
├── audio/ # 42 WAV files
└── metadata.csv # Speaker metadata & transcripts
```
## Data Splits
### Hausa (Nigeria)
- **Files**: 42 recordings
- **Dialect**: Primarily Kano Hausa (standard/prestige dialect)
- **Context**: Natural spontaneous speech
- **Use case**: General-purpose Hausa ASR, West African voice AI
## Languages Sampled in This Dataset ✅
42 audio samples available for immediate download:
- **Hausa**: 42 files (~25 minutes)
## Full OTS Inventory Available 📊
This sample represents **<0.16%** of Silencio's complete Hausa speech inventory.
Contact us for access to our full Hausa corpus:
**Hausa by Country:**
- **Nigeria**: 964 hours, 124,481 recordings
- **United States**: 5 hours, 1,052 recordings
- **United States Minor Outlying Islands**: 3 hours, 273 recordings
- **South Africa**: 2 hours, 192 recordings
- **Algeria**: 2 hours, 230 recordings
- **American Samoa**: 1 hour, 142 recordings
- **Niger**: 1 hour, 135 recordings
- **Uganda**: 1 hour, 50 recordings
- **+ 10 more countries** (diaspora communities)
**Total**: **981+ hours** across **127,000+ recordings**
**Contact us for access**: [sofia@silencioai.com](mailto:sofia@silencioai.com)
## Key Features
✅ **Native speakers** - Authentic Nigerian Hausa (Kano dialect)
✅ **Natural speech** - Real conversational Hausa, not scripted
✅ **Standard dialect** - Kano variant (prestige/most widely understood)
✅ **Diverse topics** - Daily life, opinions, business, culture
✅ **High audio quality** - 48 kHz WAV format
✅ **Rich metadata** - Gender, dialect, emotions, transcriptions in Latin script
✅ **Ethical data collection** - Consent-based, privacy-preserving
## Use Cases
### 1. Hausa Speech Recognition
Build ASR systems for the 80M+ Hausa-speaking market in Nigeria, Niger, and West Africa.
### 2. Voice Banking & Fintech
Power voice-enabled mobile banking in Northern Nigeria (Kano, Kaduna, Katsina).
### 3. Hausa TTS
Train text-to-speech models with authentic Kano/Northern Nigerian Hausa pronunciation.
### 4. Content Moderation
Build speech detection for Nigerian social media platforms and Kannywood content.
### 5. Kannywood & Media
Improve automatic transcription for Hausa films, radio, podcasts (Kannywood is a major industry).
### 6. Voice Assistants
Develop Hausa-language voice assistants for Northern Nigeria's growing smartphone market.
## Loading the Dataset
```python
from datasets import load_dataset
# Load full Hausa dataset
dataset = load_dataset("SilencioNetwork/hausa-speech")
# Access samples
for sample in dataset['train']:
audio = sample['audio']
transcript = sample['transcript']
dialect = sample['dialect']
print(f"Transcript: {transcript}")
print(f"Dialect: {dialect}")
```
## Sample Metadata
Each recording includes:
- `file_name`: Audio file path
- `id`: Unique recording ID
- `gender`: Speaker gender
- `location`: Speaker location
- `mother_tongue`: Native language (Hausa)
- `dialect`: Regional variant (Nigeria - Kano)
- `duration`: Recording length (seconds)
- `emotions`: Emotion labels (happy, tired, relaxed, etc.)
- `language`: Hausa
- `type_of_script`: free_speech (spontaneous, unscripted)
- `transcript`: Whisper-generated transcription (Latin/Boko script)
- `script`: Original prompt (question asked in Hausa)
## Hausa Speech Characteristics
This dataset captures authentic Hausa speech features:
- **Tonal language**: 2-3 tone system (high, low, falling)
- **Ejective consonants**: ƙ, ɓ, ɗ (unique to Hausa)
- **Vowel length distinction**: Short vs long vowels change meaning
- **Arabic loanwords**: Rich Islamic vocabulary
- **Code-switching**: Natural mixing with English (Nigeria), French (Niger)
- **Natural prosody**: Authentic rhythm, stress, intonation
- **Real-world audio**: Mobile recordings, natural environments
## Market Context
### West African Tech & Economy
- **80M+ Hausa speakers** - One of Africa's top 5 languages
- **Northern Nigeria**: 100M+ population, major economic zone
- **Kano**: 4M population, commercial capital of Northern Nigeria
- **Smartphone penetration**: Growing 20%+ annually in Northern Nigeria
- **Kannywood**: Billion-dollar Hausa film industry (rival to Nollywood)
- **BBC Hausa**: Major news service reaching millions
- **Islamic scholarship**: Rich tradition of Hausa-language education
### Why Hausa Matters
- **Underrepresented in AI**: <0.1% of speech datasets despite 80M+ speakers
- **Trade language**: West African lingua franca (Nigeria, Niger, Ghana, Chad)
- **Large market**: Northern Nigeria's economy larger than many African countries
- **Cultural influence**: Hausa music, literature, film reaching across West Africa
- **Growing digital economy**: E-commerce, fintech, edtech emerging in Kano/Kaduna
## Hausa Dialects
**Kano Hausa** (represented in this dataset) is the **prestige dialect**:
- Most widely understood across Hausa-speaking regions
- Used in media (BBC Hausa, VOA Hausa, Radio France International)
- Taught in schools
- Standard for written Hausa
Other major dialects: Sokoto, Katsina, Zaria (all mutually intelligible)
## Ethical Considerations
All data was collected with explicit informed consent from native Hausa speakers. Recordings contain general conversational topics only - no sensitive personal information.
## Comparison to Other Datasets
| Dataset | Language | Hours | Speakers | Natural? |
|---------|----------|-------|----------|----------|
| LibriSpeech | English | 1,000 | 2,484 | ❌ Read speech |
| Common Voice | Hausa | ~10 | Few | ⚠️ Read sentences |
| **Silencio Hausa** | **Hausa** | **981+** | **3,500+** | **✅ Spontaneous** |
**This is the largest natural Hausa speech dataset available.**
## Citation
If you use this dataset in your research or commercial product, please cite:
```bibtex
@dataset{silencio_hausa_speech_2026,
title={Hausa Speech Dataset},
author={Silencio Network},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/SilencioNetwork/hausa-speech}
}
```
## Related Datasets
- [African Languages Speech](https://huggingface.co/datasets/SilencioNetwork/african-languages-speech) - 6 African languages (Swahili, Hausa, Yoruba, Igbo, Amharic, Nigerian English)
- [Yoruba Speech](https://huggingface.co/datasets/SilencioNetwork/yoruba-speech) - 50 Yoruba samples (fellow Nigerian language)
- [Complete Voice AI Speech Dataset](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset) - 39 language/accent variants
- [Indian Languages Speech](https://huggingface.co/datasets/SilencioNetwork/indian-languages-speech) - 9 Indian languages
## License
**CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International)
✅ Free for research and non-commercial use
❌ Commercial use requires licensing (contact us)
## About Silencio
Silencio is a voice AI data sourcing company with 2M+ contributors across 180+ countries. We provide scaled sourcing of real-world audio and speech data for AI labs, robotics companies, and enterprises building voice AI products.
🌐 [silencioai.com](https://www.silencioai.com)
📧 [sofia@silencioai.com](mailto:sofia@silencioai.com)
---
**Tags**: hausa, hausa language, nigerian languages, west africa, nigeria, niger, kano, kannywood, african languages, low-resource languages, speech recognition, asr, tts, voice ai, natural speech, spontaneous speech, nigerian speech, chadic languages
提供机构:
SilencioNetwork



