SilencioNetwork/amharic-speech
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SilencioNetwork/amharic-speech
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- am
task_categories:
- automatic-speech-recognition
- text-to-speech
tags:
- amharic
- ethiopian-languages
- east-africa
- ethiopia
- geez-script
- semitic-languages
- african-languages
- low-resource
- speech-data
- voice-ai
- asr
- tts
pretty_name: "Amharic Speech Dataset"
dataset_info:
features:
- name: file_name
dtype: string
- name: id
dtype: int64
- name: gender
dtype: string
- name: ethnicity
dtype: string
- name: occupation
dtype: string
- name: birth_place
dtype: string
- name: mother_tongue
dtype: string
- name: dialect
dtype: string
- name: year_of_birth
dtype: int64
- name: years_at_birth_place
dtype: int64
- name: languages_data
dtype: string
- name: os
dtype: string
- name: device
dtype: string
- name: browser
dtype: string
- name: duration
dtype: float64
- name: emotions
dtype: string
- name: language
dtype: string
- name: location
dtype: string
- name: noise_sources
dtype: string
- name: script_id
dtype: int64
- name: type_of_script
dtype: string
- name: script
dtype: string
- name: transcript
dtype: string
- name: speaker_id
dtype: string
configs:
- config_name: amharic_ethiopia
data_files:
- split: free_speech
path: amharic_ethiopia/free_speech/**
size_categories:
- n<1K
---
# Amharic Speech Dataset
**The most comprehensive Amharic speech dataset on HuggingFace - natural, real-world Amharic from native speakers in Ethiopia and the diaspora.**
## Dataset Overview
- **Total audio samples**: 51 recordings
- **Total duration**: ~23 minutes
- **Primary region**: Ethiopia (Addis Ababa)
- **Context**: Natural spontaneous speech (free_speech)
- **Audio format**: WAV files
- **Sample rate**: 48 kHz
- **License**: CC BY-NC 4.0 (free for research, non-commercial use)
## Language Context
**Amharic (አማርኛ)** is Ethiopia's primary language:
- **Speakers**: 57M+ (32M native, 25M+ L2)
- **Official language**: Ethiopia (federal working language)
- **Geographic spread**: Ethiopia (primarily central/northern regions)
- **Ge'ez script**: Unique abugida writing system (syllabic alphabet)
- **Linguistic family**: Semitic (Afro-Asiatic) - related to Arabic, Hebrew, Tigrinya
- **Cultural significance**: Ethiopian Orthodox Christianity, Ethiopian literature, music
- **Digital presence**: Growing on social media, YouTube, Ethiopian tech ecosystem
## Target Applications
This dataset is designed for:
- **Amharic ASR systems** - Speech recognition for 57M+ speakers
- **Voice assistants** - Ethiopian tech startups, mobile banking
- **TTS for Amharic** - Text-to-speech with authentic Ethiopian pronunciation
- **Language learning apps** - Pronunciation training for Amharic learners
- **Content moderation** - Social media platforms operating in Ethiopia
- **Transcription services** - Ethiopian media, podcasts, YouTube content
- **Government services** - Voice-enabled public services in Ethiopia
## Dataset Structure
```
amharic-speech/
└── data/
├── audio/ # 51 WAV files
└── metadata.csv # Speaker metadata & transcripts
```
## Data Splits
### Amharic (Ethiopia)
- **Files**: 51 recordings
- **Dialect**: Primarily Addis Ababa (standard Amharic)
- **Context**: Natural spontaneous speech
- **Use case**: General-purpose Amharic ASR, Ethiopian voice AI
## Languages Sampled in This Dataset ✅
51 audio samples available for immediate download:
- **Amharic**: 51 files (~23 minutes)
## Full OTS Inventory Available 📊
This sample represents **<0.19%** of Silencio's complete Amharic speech inventory.
Contact us for access to our full Amharic corpus:
**Amharic by Country:**
- **Ethiopia**: 1,058 hours, 102,378 recordings
- **American Samoa**: 7 hours, 623 recordings
- **Faroe Islands**: 3 hours, 653 recordings
- **Angola**: 2 hours, 311 recordings
- **United States**: 2 hours, 260 recordings
- **Algeria**: 2 hours, 233 recordings
- **Albania**: 2 hours, 83 recordings
- **Honduras**: 1 hour, 234 recordings
- **+ 15 more countries** (diaspora communities)
**Total**: **1,081+ hours** across **105,000+ recordings**
**Contact us for access**: [sofia@silencioai.com](mailto:sofia@silencioai.com)
## Key Features
✅ **Native speakers** - Authentic Ethiopian Amharic (Addis Ababa)
✅ **Natural speech** - Real conversational Amharic, not scripted
✅ **Standard dialect** - Addis Ababa variant (widely understood)
✅ **Diverse topics** - Daily life, opinions, technology, culture
✅ **High audio quality** - 48 kHz WAV format
✅ **Rich metadata** - Gender, dialect, emotions, transcriptions in Ge'ez script
✅ **Ethical data collection** - Consent-based, privacy-preserving
## Use Cases
### 1. Amharic Speech Recognition
Build ASR systems for the 57M+ Amharic-speaking market in Ethiopia.
### 2. Voice Banking & Fintech
Power voice-enabled mobile banking in Ethiopia (M-BIRR, HelloCash, CBE Birr).
### 3. Amharic TTS
Train text-to-speech models with authentic Ethiopian Amharic pronunciation.
### 4. Content Moderation
Build speech detection for Ethiopian social media platforms and YouTube.
### 5. Government Services
Enable voice-based public services in Ethiopia (health, education, agriculture).
### 6. Voice Assistants
Develop Amharic-language voice assistants for Ethiopia's growing smartphone market.
## Loading the Dataset
```python
from datasets import load_dataset
# Load full Amharic dataset
dataset = load_dataset("SilencioNetwork/amharic-speech")
# Access samples
for sample in dataset['train']:
audio = sample['audio']
transcript = sample['transcript']
dialect = sample['dialect']
print(f"Transcript: {transcript}")
print(f"Dialect: {dialect}")
```
## Sample Metadata
Each recording includes:
- `file_name`: Audio file path
- `id`: Unique recording ID
- `gender`: Speaker gender
- `location`: Speaker location
- `mother_tongue`: Native language (Amharic)
- `dialect`: Regional variant (Ethiopia - Addis Ababa)
- `duration`: Recording length (seconds)
- `emotions`: Emotion labels (happy, excited, focused, relaxed, etc.)
- `language`: Amharic
- `type_of_script`: free_speech (spontaneous, unscripted)
- `transcript`: Whisper-generated transcription (Ge'ez script)
- `script`: Original prompt (question asked in Amharic)
## Amharic Speech Characteristics
This dataset captures authentic Amharic speech features:
- **Ge'ez script phonology**: Ejective consonants (ጠ, ቀ, ጨ), labialized consonants
- **Semitic features**: Triconsonantal root system (like Arabic/Hebrew)
- **Complex morphology**: Rich verb conjugation, case marking
- **Tone/stress**: Stress-accent patterns
- **Natural prosody**: Authentic rhythm, intonation
- **Real-world audio**: Mobile recordings, natural environments
## Market Context
### Ethiopian Tech & Economy
- **57M+ Amharic speakers** - Ethiopia's lingua franca
- **Ethiopia**: 120M population (Africa's 2nd most populous), 25M+ internet users
- **Smartphone penetration**: 45% and growing rapidly
- **Digital payments**: Mobile money growing 40%+ annually
- **Tech ecosystem**: Addis Ababa emerging as East African tech hub
- **YouTube**: Ethiopian content exploding (music, news, education)
### Why Amharic Matters
- **Underrepresented in AI**: <0.01% of speech datasets despite 57M+ speakers
- **National language**: Ethiopia's federal working language (government, education, media)
- **Ancient script**: Ge'ez alphabet (one of Africa's oldest writing systems)
- **Growing digital economy**: E-commerce, fintech, edtech booming in Ethiopia
- **Large youth population**: 70% under 30 = massive smartphone adoption potential
## Ge'ez Script
Amharic uses the **Ge'ez script** (also called Ethiopic script):
- **Abugida**: Each character represents a consonant+vowel combination
- **7 vowel orders**: ሀ ሁ ሂ ሃ ሄ ህ ሆ (ha, hu, hi, ha, he, hi, ho)
- **33 base consonants** × 7 vowel orders = 231+ characters
- **Unique to Ethiopia/Eritrea**: Used for Amharic, Tigrinya, Ge'ez (liturgical)
- **Left-to-right**: Unlike Arabic/Hebrew (though same Semitic family)
**ASR/TTS systems need to handle this unique script for written transcription.**
## Ethical Considerations
All data was collected with explicit informed consent from native Amharic speakers. Recordings contain general conversational topics only - no sensitive personal information.
## Comparison to Other Datasets
| Dataset | Language | Hours | Speakers | Natural? |
|---------|----------|-------|----------|----------|
| LibriSpeech | English | 1,000 | 2,484 | ❌ Read speech |
| Common Voice | Amharic | ~20 | Few | ⚠️ Read sentences |
| **Silencio Amharic** | **Amharic** | **1,081+** | **4,500+** | **✅ Spontaneous** |
**This is the largest natural Amharic speech dataset available.**
## Citation
If you use this dataset in your research or commercial product, please cite:
```bibtex
@dataset{silencio_amharic_speech_2026,
title={Amharic Speech Dataset},
author={Silencio Network},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/SilencioNetwork/amharic-speech}
}
```
## Related Datasets
- [African Languages Speech](https://huggingface.co/datasets/SilencioNetwork/african-languages-speech) - 6 African languages (Swahili, Hausa, Yoruba, Igbo, Amharic, Nigerian English)
- [Complete Voice AI Speech Dataset](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset) - 39 language/accent variants
- [Indian Languages Speech](https://huggingface.co/datasets/SilencioNetwork/indian-languages-speech) - 9 Indian languages
- [Yoruba Speech](https://huggingface.co/datasets/SilencioNetwork/yoruba-speech) - 50 Yoruba samples
## License
**CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International)
✅ Free for research and non-commercial use
❌ Commercial use requires licensing (contact us)
## About Silencio
Silencio is a voice AI data sourcing company with 2M+ contributors across 180+ countries. We provide scaled sourcing of real-world audio and speech data for AI labs, robotics companies, and enterprises building voice AI products.
🌐 [silencioai.com](https://www.silencioai.com)
📧 [sofia@silencioai.com](mailto:sofia@silencioai.com)
---
**Tags**: amharic, አማርኛ, ethiopian languages, east africa, ethiopia, geez script, ethiopic script, semitic languages, african languages, low-resource languages, speech recognition, asr, tts, voice ai, natural speech, spontaneous speech, ethiopian speech, addis ababa
提供机构:
SilencioNetwork



