SilencioNetwork/complete-voiceai-speech-dataset
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SilencioNetwork/complete-voiceai-speech-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- en
- de
- es
- fr
- pt
- ru
- tr
- vi
- ja
- it
- gu
- kn
- ml
- mr
- or
- te
- ar
- uk
- be
- zh
- pl
- sw
- ha
- yo
- zu
- am
- ig
multilinguality:
- multilingual
task_categories:
- automatic-speech-recognition
- audio-classification
- text-to-speech
tags:
- voice-ai
- speech-data
- accent
- emotion
- african-languages
- multilingual
- asr
- tts
- conversational-ai
- real-world-audio
- crowdsourced
pretty_name: "🎙️ Silencio Voice AI Sample Dataset"
dataset_info:
features:
- name: file_name
dtype: string
- name: id
dtype: int64
- name: gender
dtype: string
- name: ethnicity
dtype: string
- name: occupation
dtype: string
- name: birth_place
dtype: string
- name: mother_tongue
dtype: string
- name: dialect
dtype: string
- name: year_of_birth
dtype: int64
- name: years_at_birth_place
dtype: int64
- name: languages_data
dtype: string
- name: os
dtype: string
- name: device
dtype: string
- name: browser
dtype: string
- name: duration
dtype: float64
- name: emotions
dtype: string
- name: language
dtype: string
- name: location
dtype: string
- name: noise_sources
dtype: string
- name: script_id
dtype: int64
- name: type_of_script
dtype: string
- name: script
dtype: string
- name: transcript
dtype: string
- name: speaker_id
dtype: string
configs:
- config_name: spanish_mexico
data_files:
- split: free_speech
path: spanish_mexico/free_speech/**
- split: keywords
path: spanish_mexico/keywords/**
- split: monologues
path: spanish_mexico/monologues/**
- config_name: english_china
data_files:
- split: free_speech
path: english_china/free_speech/**
- split: keywords
path: english_china/keywords/**
- split: monologues
path: english_china/monologues/**
- config_name: english_nigeria
data_files:
- split: free_speech
path: english_nigeria/free_speech/**
- split: keywords
path: english_nigeria/keywords/**
- split: monologues
path: english_nigeria/monologues/**
- config_name: english_united_states
data_files:
- split: free_speech
path: english_united_states/free_speech/**
- split: keywords
path: english_united_states/keywords/**
- split: monologues
path: english_united_states/monologues/**
- config_name: german_germany
data_files:
- split: free_speech
path: german_germany/free_speech/**
- split: keywords
path: german_germany/keywords/**
- split: monologues
path: german_germany/monologues/**
- config_name: english_algeria
data_files:
- split: free_speech
path: english_algeria/free_speech/**
- config_name: english_australia
data_files:
- split: free_speech
path: english_australia/free_speech/**
- config_name: english_belarus
data_files:
- split: free_speech
path: english_belarus/free_speech/**
- config_name: english_egypt
data_files:
- split: free_speech
path: english_egypt/free_speech/**
- config_name: english_french_speaking
data_files:
- split: free_speech
path: english_french_speaking/free_speech/**
- config_name: english_haiti
data_files:
- split: free_speech
path: english_haiti/free_speech/**
- config_name: english_ireland
data_files:
- split: free_speech
path: english_ireland/free_speech/**
- config_name: english_jamaica
data_files:
- split: free_speech
path: english_jamaica/free_speech/**
- config_name: english_kenya
data_files:
- split: free_speech
path: english_kenya/free_speech/**
- config_name: english_mandarin
data_files:
- split: free_speech
path: english_mandarin/free_speech/**
- config_name: english_medical
data_files:
- split: medical
path: english_medical/medical/**
- config_name: english_pakistan
data_files:
- split: free_speech
path: english_pakistan/free_speech/**
- config_name: english_poland
data_files:
- split: free_speech
path: english_poland/free_speech/**
- config_name: english_russia
data_files:
- split: free_speech
path: english_russia/free_speech/**
- config_name: english_south_africa
data_files:
- split: free_speech
path: english_south_africa/free_speech/**
- config_name: english_uganda
data_files:
- split: free_speech
path: english_uganda/free_speech/**
- config_name: english_united_kingdom
data_files:
- split: free_speech
path: english_united_kingdom/free_speech/**
- config_name: english_ukraine
data_files:
- split: free_speech
path: english_ukraine/free_speech/**
- config_name: french_canada
data_files:
- split: free_speech
path: french_canada/free_speech/**
- config_name: french_global
data_files:
- split: free_speech
path: french_global/free_speech/**
- config_name: global_medical
data_files:
- split: medical
path: global_medical/medical/**
- config_name: gujarati_india
data_files:
- split: free_speech
path: gujarati_india/free_speech/**
- config_name: italian_italy
data_files:
- split: free_speech
path: italian_italy/free_speech/**
- config_name: japanese_japan
data_files:
- split: free_speech
path: japanese_japan/free_speech/**
- config_name: kannada_india
data_files:
- split: free_speech
path: kannada_india/free_speech/**
- config_name: malayalam_india
data_files:
- split: free_speech
path: malayalam_india/free_speech/**
- config_name: mandarin_chinese_china
data_files:
- split: free_speech
path: mandarin_chinese_china/free_speech/**
- config_name: marathi_india
data_files:
- split: free_speech
path: marathi_india/free_speech/**
- config_name: odia_india
data_files:
- split: free_speech
path: odia_india/free_speech/**
- config_name: portuguese_brazil
data_files:
- split: free_speech
path: portuguese_brazil/free_speech/**
- config_name: russian_russia
data_files:
- split: free_speech
path: russian_russia/free_speech/**
- config_name: telugu_india
data_files:
- split: free_speech
path: telugu_india/free_speech/**
- config_name: turkish_turkey
data_files:
- split: free_speech
path: turkish_turkey/free_speech/**
- split: monologues
path: turkish_turkey/monologues/**
- config_name: vietnamese_vietnam
data_files:
- split: monologues
path: vietnamese_vietnam/monologues/**
- config_name: amharic_ethiopia
data_files:
- split: free_speech
path: amharic_ethiopia/free_speech/**
- config_name: hausa_nigeria
data_files:
- split: free_speech
path: hausa_nigeria/free_speech/**
- config_name: igbo_nigeria
data_files:
- split: free_speech
path: igbo_nigeria/free_speech/**
- config_name: yoruba_nigeria
data_files:
- split: free_speech
path: yoruba_nigeria/free_speech/**
size_categories:
- 1K<n<10K
---
# 🎙️ Silencio Network: Voice AI Sample Dataset
<p align="left">
<img src="https://cdn-uploads.huggingface.co/production/uploads/69162b50b89e7abe20de4b5a/LWhs4p2lPFcyiVsP0tluu.png" width="40%">
</p>
[](https://www.silencioai.com)
[](mailto:sofia@silencioai.com)
[](mailto:sofia@silencioai.com)
---
> **📊 This is a sample.** The full Silencio corpus contains **100,000+ hours** across **170+ countries** and **100+ languages**.
>
> **📧 Contact:** [sofia@silencioai.com](mailto:sofia@silencioai.com) for custom datasets, bulk licensing, or specific language requests.
---
## 🌍 Why Silencio Data?
Silencio data is collected **in the wild** from a massive, opt-in community (**2M+ contributors** across **180+ countries**), giving you:
- ✅ **Real-world accents, dialects, devices, and environments** that lab or scraped datasets don't capture
- ✅ **Explicit, traceable consent** — every recording tied to verified opt-in (GDPR/CCPA compliant)
- ✅ **Privacy-first pipelines** — anonymized, PII hashed, reduced legal risk for enterprise
- ✅ **Rapid scaling** into hard-to-source languages and niches
## 📊 Full Data Availability
| Category | Available | Top Languages |
|----------|-----------|---------------|
| **African Languages** | 25,000+ hrs | Swahili (9.7k), Nigerian English (8.1k), Hausa, Yoruba, Zulu |
| **Asian Languages** | 15,000+ hrs | Hindi, Mandarin, Japanese, Korean, Vietnamese |
| **European Languages** | 20,000+ hrs | German, Spanish, French, Portuguese, Polish |
| **Emotional Speech** | 5,000+ hrs | Joy, anger, sadness, fear, surprise, neutral |
| **Multi-Speaker** | 10,000+ hrs | Overlapping speech, turn-taking, interruptions |
| **Conversational** | 30,000+ hrs | Natural dialogue, disfluencies, backchannels |
**Custom requests?** We can source virtually any language, accent, or demographic at scale.
---
## 🎯 This Sample Dataset
This sample covers **43 language–region configs** across **20+ languages** demonstrating our data quality:
### Accented English
| Config | Accent/Region | Splits | Samples |
|--------|---------------|--------|---------|
| `english_algeria` | Algerian | free_speech | 21 |
| `english_australia` | Australian | free_speech | 25 |
| `english_belarus` | Belarusian | free_speech | 21 |
| `english_china` | Mandarin-influenced | free_speech, keywords, monologues | 75 |
| `english_egypt` | Egyptian | free_speech | 25 |
| `english_french_speaking` | French-speaking countries | free_speech | 21 |
| `english_haiti` | Haitian | free_speech | 21 |
| `english_ireland` | Irish | free_speech | 21 |
| `english_jamaica` | Jamaican | free_speech | 25 |
| `english_kenya` | Kenyan | free_speech | 25 |
| `english_mandarin` | Mandarin-influenced (Global) | free_speech | 25 |
| `english_nigeria` | Nigerian | free_speech, keywords, monologues | 75 |
| `english_pakistan` | Pakistani | free_speech | 25 |
| `english_poland` | Polish | free_speech | 25 |
| `english_russia` | Russian | free_speech | 25 |
| `english_south_africa` | South African | free_speech | 17 |
| `english_uganda` | Ugandan | free_speech | 25 |
| `english_united_kingdom` | British | free_speech | 25 |
| `english_united_states` | American | free_speech, keywords, monologues | 75 |
| `english_ukraine` | Ukrainian | free_speech | 25 |
### African Languages
| Config | Language | Region | Splits | Samples |
|--------|----------|--------|--------|---------|
| `amharic_ethiopia` | Amharic | Ethiopia | free_speech | 51 |
| `hausa_nigeria` | Hausa | Nigeria | free_speech | 42 |
| `igbo_nigeria` | Igbo | Nigeria | free_speech | 36 |
| `yoruba_nigeria` | Yoruba | Nigeria | free_speech | 39 |
### Other Languages
| Config | Language | Region | Splits | Samples |
|--------|----------|--------|--------|---------|
| `french_canada` | French | Canada | free_speech | 25 |
| `french_global` | French | Global | free_speech | 25 |
| `german_germany` | German | Germany | free_speech, keywords, monologues | 75 |
| `gujarati_india` | Gujarati | India | free_speech | 18 |
| `italian_italy` | Italian | Italy | free_speech | 25 |
| `japanese_japan` | Japanese | Japan | free_speech | 25 |
| `kannada_india` | Kannada | India | free_speech | 25 |
| `malayalam_india` | Malayalam | India | free_speech | 25 |
| `mandarin_chinese_china` | Mandarin | China | free_speech | 25 |
| `marathi_india` | Marathi | India | free_speech | 25 |
| `odia_india` | Odia | India | free_speech | 21 |
| `portuguese_brazil` | Portuguese | Brazil | free_speech | 25 |
| `russian_russia` | Russian | Russia | free_speech | 25 |
| `spanish_mexico` | Spanish | Mexico | free_speech, keywords, monologues | 56 |
| `telugu_india` | Telugu | India | free_speech | 25 |
| `turkish_turkey` | Turkish | Turkey | free_speech, monologues | 32 |
| `vietnamese_vietnam` | Vietnamese | Vietnam | monologues | 15 |
### Medical Domain
| Config | Language | Splits | Samples |
|--------|----------|--------|---------|
| `english_medical` | English | medical | 8 |
| `global_medical` | Multilingual (EN/ES/DE) | medical | 25 |
## 🚀 Quick Start
```python
from datasets import load_dataset
# Load Nigerian English samples
ds = load_dataset("SilencioNetwork/complete-voiceai-speech-dataset", "english_nigeria")
# Access different speech types
free_speech = ds['free_speech']
keywords = ds['keywords']
monologues = ds['monologues']
# Process a sample
for sample in free_speech:
audio = sample['audio']
transcript = sample['transcript']
speaker_id = sample['speaker_id']
emotion = sample['emotions']
print(f"Speaker {speaker_id}: {transcript[:50]}...")
```
## 🎤 Speech Types
| Type | Description | Use Cases |
|------|-------------|-----------|
| `free_speech` | Unscripted speech on provided topics | Conversational AI, dialogue systems |
| `keywords` | Short isolated phrases/terms | Wake word detection, command recognition |
| `monologues` | Longer scripted passages | TTS training, ASR benchmarking |
## 📋 Rich Metadata
Every recording includes:
- **Speaker demographics**: gender, ethnicity, occupation, birth year, birth place
- **Linguistic info**: mother tongue, dialect, language proficiency data
- **Recording context**: device, OS, browser, location, background noise
- **Content**: script, Whisper-generated transcript, emotion labels
- **Technical**: duration, 48kHz WAV format
## 🎯 Ideal For
- 🗣️ **ASR training** — accent-robust speech recognition
- 🔊 **TTS development** — diverse voice synthesis
- 😊 **Emotion recognition** — labeled emotional speech
- 🌍 **Multilingual models** — 100+ languages available
- 🎙️ **Speaker verification** — unique speaker embeddings
- 📊 **Benchmarking** — real-world robustness testing
## ⚖️ License
**CC BY-NC 4.0** — Free for research and non-commercial use.
For **commercial licensing**, contact [sofia@silencioai.com](mailto:sofia@silencioai.com).
## 📧 Get More Data
**Need the full corpus?** This sample represents <0.01% of our available data.
| What You Need | We Can Provide |
|---------------|----------------|
| Specific language/accent | ✅ 100+ languages available |
| Emotional speech | ✅ 5,000+ hours labeled |
| Multi-speaker/overlapping | ✅ 10,000+ hours |
| Custom demographics | ✅ Age, gender, occupation targeting |
| Bulk volume | ✅ 100,000+ hours total |
**📧 Email:** [sofia@silencioai.com](mailto:sofia@silencioai.com)
**🌐 Website:** [silencioai.com](https://www.silencioai.com)
---
## Citation
```bibtex
@dataset{silencio_network_speech_2025,
title = {Silencio Network Voice AI Speech Corpus},
author = {Silencio Network},
year = {2025},
publisher = {Hugging Face},
license = {CC BY-NC 4.0}
}
```
提供机构:
SilencioNetwork



