five

SilencioNetwork/amharic-speech

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SilencioNetwork/amharic-speech
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - am task_categories: - automatic-speech-recognition - text-to-speech tags: - amharic - ethiopian-languages - east-africa - ethiopia - geez-script - semitic-languages - african-languages - low-resource - speech-data - voice-ai - asr - tts pretty_name: "Amharic Speech Dataset" dataset_info: features: - name: file_name dtype: string - name: id dtype: int64 - name: gender dtype: string - name: ethnicity dtype: string - name: occupation dtype: string - name: birth_place dtype: string - name: mother_tongue dtype: string - name: dialect dtype: string - name: year_of_birth dtype: int64 - name: years_at_birth_place dtype: int64 - name: languages_data dtype: string - name: os dtype: string - name: device dtype: string - name: browser dtype: string - name: duration dtype: float64 - name: emotions dtype: string - name: language dtype: string - name: location dtype: string - name: noise_sources dtype: string - name: script_id dtype: int64 - name: type_of_script dtype: string - name: script dtype: string - name: transcript dtype: string - name: speaker_id dtype: string configs: - config_name: amharic_ethiopia data_files: - split: free_speech path: amharic_ethiopia/free_speech/** size_categories: - n<1K --- # Amharic Speech Dataset **The most comprehensive Amharic speech dataset on HuggingFace - natural, real-world Amharic from native speakers in Ethiopia and the diaspora.** ## Dataset Overview - **Total audio samples**: 51 recordings - **Total duration**: ~23 minutes - **Primary region**: Ethiopia (Addis Ababa) - **Context**: Natural spontaneous speech (free_speech) - **Audio format**: WAV files - **Sample rate**: 48 kHz - **License**: CC BY-NC 4.0 (free for research, non-commercial use) ## Language Context **Amharic (አማርኛ)** is Ethiopia's primary language: - **Speakers**: 57M+ (32M native, 25M+ L2) - **Official language**: Ethiopia (federal working language) - **Geographic spread**: Ethiopia (primarily central/northern regions) - **Ge'ez script**: Unique abugida writing system (syllabic alphabet) - **Linguistic family**: Semitic (Afro-Asiatic) - related to Arabic, Hebrew, Tigrinya - **Cultural significance**: Ethiopian Orthodox Christianity, Ethiopian literature, music - **Digital presence**: Growing on social media, YouTube, Ethiopian tech ecosystem ## Target Applications This dataset is designed for: - **Amharic ASR systems** - Speech recognition for 57M+ speakers - **Voice assistants** - Ethiopian tech startups, mobile banking - **TTS for Amharic** - Text-to-speech with authentic Ethiopian pronunciation - **Language learning apps** - Pronunciation training for Amharic learners - **Content moderation** - Social media platforms operating in Ethiopia - **Transcription services** - Ethiopian media, podcasts, YouTube content - **Government services** - Voice-enabled public services in Ethiopia ## Dataset Structure ``` amharic-speech/ └── data/ ├── audio/ # 51 WAV files └── metadata.csv # Speaker metadata & transcripts ``` ## Data Splits ### Amharic (Ethiopia) - **Files**: 51 recordings - **Dialect**: Primarily Addis Ababa (standard Amharic) - **Context**: Natural spontaneous speech - **Use case**: General-purpose Amharic ASR, Ethiopian voice AI ## Languages Sampled in This Dataset ✅ 51 audio samples available for immediate download: - **Amharic**: 51 files (~23 minutes) ## Full OTS Inventory Available 📊 This sample represents **<0.19%** of Silencio's complete Amharic speech inventory. Contact us for access to our full Amharic corpus: **Amharic by Country:** - **Ethiopia**: 1,058 hours, 102,378 recordings - **American Samoa**: 7 hours, 623 recordings - **Faroe Islands**: 3 hours, 653 recordings - **Angola**: 2 hours, 311 recordings - **United States**: 2 hours, 260 recordings - **Algeria**: 2 hours, 233 recordings - **Albania**: 2 hours, 83 recordings - **Honduras**: 1 hour, 234 recordings - **+ 15 more countries** (diaspora communities) **Total**: **1,081+ hours** across **105,000+ recordings** **Contact us for access**: [sofia@silencioai.com](mailto:sofia@silencioai.com) ## Key Features ✅ **Native speakers** - Authentic Ethiopian Amharic (Addis Ababa) ✅ **Natural speech** - Real conversational Amharic, not scripted ✅ **Standard dialect** - Addis Ababa variant (widely understood) ✅ **Diverse topics** - Daily life, opinions, technology, culture ✅ **High audio quality** - 48 kHz WAV format ✅ **Rich metadata** - Gender, dialect, emotions, transcriptions in Ge'ez script ✅ **Ethical data collection** - Consent-based, privacy-preserving ## Use Cases ### 1. Amharic Speech Recognition Build ASR systems for the 57M+ Amharic-speaking market in Ethiopia. ### 2. Voice Banking & Fintech Power voice-enabled mobile banking in Ethiopia (M-BIRR, HelloCash, CBE Birr). ### 3. Amharic TTS Train text-to-speech models with authentic Ethiopian Amharic pronunciation. ### 4. Content Moderation Build speech detection for Ethiopian social media platforms and YouTube. ### 5. Government Services Enable voice-based public services in Ethiopia (health, education, agriculture). ### 6. Voice Assistants Develop Amharic-language voice assistants for Ethiopia's growing smartphone market. ## Loading the Dataset ```python from datasets import load_dataset # Load full Amharic dataset dataset = load_dataset("SilencioNetwork/amharic-speech") # Access samples for sample in dataset['train']: audio = sample['audio'] transcript = sample['transcript'] dialect = sample['dialect'] print(f"Transcript: {transcript}") print(f"Dialect: {dialect}") ``` ## Sample Metadata Each recording includes: - `file_name`: Audio file path - `id`: Unique recording ID - `gender`: Speaker gender - `location`: Speaker location - `mother_tongue`: Native language (Amharic) - `dialect`: Regional variant (Ethiopia - Addis Ababa) - `duration`: Recording length (seconds) - `emotions`: Emotion labels (happy, excited, focused, relaxed, etc.) - `language`: Amharic - `type_of_script`: free_speech (spontaneous, unscripted) - `transcript`: Whisper-generated transcription (Ge'ez script) - `script`: Original prompt (question asked in Amharic) ## Amharic Speech Characteristics This dataset captures authentic Amharic speech features: - **Ge'ez script phonology**: Ejective consonants (ጠ, ቀ, ጨ), labialized consonants - **Semitic features**: Triconsonantal root system (like Arabic/Hebrew) - **Complex morphology**: Rich verb conjugation, case marking - **Tone/stress**: Stress-accent patterns - **Natural prosody**: Authentic rhythm, intonation - **Real-world audio**: Mobile recordings, natural environments ## Market Context ### Ethiopian Tech & Economy - **57M+ Amharic speakers** - Ethiopia's lingua franca - **Ethiopia**: 120M population (Africa's 2nd most populous), 25M+ internet users - **Smartphone penetration**: 45% and growing rapidly - **Digital payments**: Mobile money growing 40%+ annually - **Tech ecosystem**: Addis Ababa emerging as East African tech hub - **YouTube**: Ethiopian content exploding (music, news, education) ### Why Amharic Matters - **Underrepresented in AI**: <0.01% of speech datasets despite 57M+ speakers - **National language**: Ethiopia's federal working language (government, education, media) - **Ancient script**: Ge'ez alphabet (one of Africa's oldest writing systems) - **Growing digital economy**: E-commerce, fintech, edtech booming in Ethiopia - **Large youth population**: 70% under 30 = massive smartphone adoption potential ## Ge'ez Script Amharic uses the **Ge'ez script** (also called Ethiopic script): - **Abugida**: Each character represents a consonant+vowel combination - **7 vowel orders**: ሀ ሁ ሂ ሃ ሄ ህ ሆ (ha, hu, hi, ha, he, hi, ho) - **33 base consonants** × 7 vowel orders = 231+ characters - **Unique to Ethiopia/Eritrea**: Used for Amharic, Tigrinya, Ge'ez (liturgical) - **Left-to-right**: Unlike Arabic/Hebrew (though same Semitic family) **ASR/TTS systems need to handle this unique script for written transcription.** ## Ethical Considerations All data was collected with explicit informed consent from native Amharic speakers. Recordings contain general conversational topics only - no sensitive personal information. ## Comparison to Other Datasets | Dataset | Language | Hours | Speakers | Natural? | |---------|----------|-------|----------|----------| | LibriSpeech | English | 1,000 | 2,484 | ❌ Read speech | | Common Voice | Amharic | ~20 | Few | ⚠️ Read sentences | | **Silencio Amharic** | **Amharic** | **1,081+** | **4,500+** | **✅ Spontaneous** | **This is the largest natural Amharic speech dataset available.** ## Citation If you use this dataset in your research or commercial product, please cite: ```bibtex @dataset{silencio_amharic_speech_2026, title={Amharic Speech Dataset}, author={Silencio Network}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/SilencioNetwork/amharic-speech} } ``` ## Related Datasets - [African Languages Speech](https://huggingface.co/datasets/SilencioNetwork/african-languages-speech) - 6 African languages (Swahili, Hausa, Yoruba, Igbo, Amharic, Nigerian English) - [Complete Voice AI Speech Dataset](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset) - 39 language/accent variants - [Indian Languages Speech](https://huggingface.co/datasets/SilencioNetwork/indian-languages-speech) - 9 Indian languages - [Yoruba Speech](https://huggingface.co/datasets/SilencioNetwork/yoruba-speech) - 50 Yoruba samples ## License **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International) ✅ Free for research and non-commercial use ❌ Commercial use requires licensing (contact us) ## About Silencio Silencio is a voice AI data sourcing company with 2M+ contributors across 180+ countries. We provide scaled sourcing of real-world audio and speech data for AI labs, robotics companies, and enterprises building voice AI products. 🌐 [silencioai.com](https://www.silencioai.com) 📧 [sofia@silencioai.com](mailto:sofia@silencioai.com) --- **Tags**: amharic, አማርኛ, ethiopian languages, east africa, ethiopia, geez script, ethiopic script, semitic languages, african languages, low-resource languages, speech recognition, asr, tts, voice ai, natural speech, spontaneous speech, ethiopian speech, addis ababa
提供机构:
SilencioNetwork
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作