AndyOnyango/KenSpeech
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AndyOnyango/KenSpeech
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- sw
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
tags:
- swahili
- kiswahili
- speech-recognition
- asr
- stt
- low-resource-languages
- african-languages
- kenya
- speech-to-text
pretty_name: KenSpeech
size_categories:
- 1K<n<10K
---
# KenSpeech: A Swahili Speech Dataset for ASR
## Dataset Description
**KenSpeech** is a comprehensive Swahili speech dataset containing both read and spontaneous speech recordings from native Swahili speakers in Kenya. This dataset is designed for training and evaluating automatic speech recognition (ASR) and speech-to-text (STT) systems for Swahili.
## Dataset Statistics
| Metric | Value |
|--------|-------|
| Total Duration | 27 hours 31 minutes 50 seconds |
| Read Speech Duration | 26 hours 32 minutes 37 seconds |
| Spontaneous Speech Duration | 59 minutes 13 seconds |
| Total Speakers | 26 |
| Female Speakers | 19 |
| Male Speakers | 7 |
| Lexicon Words | 31,728+ |
## Audio Format
| Property | Value |
|----------|-------|
| Sampling Rate | 16 kHz |
| Channels | Mono |
---
## Dataset Format
The dataset is distributed as **Parquet files** with embedded audio for optimal compatibility:
- **Format**: Apache Parquet (with embedded audio bytes)
- **Encoding**: UTF-8 for text fields
- **Compatibility**: Works with `datasets` 4.0.0+ without custom loading scripts
## Data Fields
| Column | Type | Description |
|--------|------|-------------|
| audio | Audio | Audio waveform (decoded array + sampling_rate) |
| source_folder | string | Origin folder (`stt_dictionary` or `stt_transcripts`) |
| gender | string | Speaker gender (`male` or `female`) |
| speaker | string | Speaker identifier (`speaker_1`, `speaker_2`, etc.) |
| transcript | string | Transcription text |
### Example Record
```python
{
'audio': {'path': '...', 'array': array([0.001, -0.003, ...]), 'sampling_rate': 16000},
'source_folder': 'stt_dictionary',
'gender': 'female',
'speaker': 'speaker_1',
'transcript': 'masaa mawili kabla basi kuwasili...'
}
```
---
## Usage
### Loading with Hugging Face Datasets
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("Kencorpus/KenSpeech")
# Access a sample
sample = dataset['train'][0]
print(sample['transcript'])
print(sample['gender'])
print(sample['speaker'])
print(sample['audio']['sampling_rate']) # 16000
print(sample['audio']['array'].shape) # audio waveform
```
### Filtering by Gender
```python
from datasets import load_dataset
dataset = load_dataset("Kencorpus/KenSpeech")
# Get female speakers only
female_data = dataset['train'].filter(lambda x: x['gender'] == 'female')
print(f"Female samples: {len(female_data)}")
# Get male speakers only
male_data = dataset['train'].filter(lambda x: x['gender'] == 'male')
print(f"Male samples: {len(male_data)}")
```
### Training an ASR Model
```python
from datasets import load_dataset
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
# Load dataset
dataset = load_dataset("Kencorpus/KenSpeech")
# Load a multilingual model
model_name = "facebook/wav2vec2-large-xlsr-53"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
# Process a sample
sample = dataset['train'][0]
inputs = processor(sample['audio']['array'], sampling_rate=16000, return_tensors="pt")
```
---
## Additional Resources
### Pronunciation Lexicon (`lexicon.csv`)
A Swahili lexicon-phone dictionary with over 31,000 words and their phonetic transcriptions.
**Format:** `word,phoneme_sequence`
```
wanapaswa,W AH N AH P AH S W AH
wanasema,W AH N AH S EH M AH
wanataka,W AH N AH T AH K AH
```
### Transcript-only Data (`transcripts_only.csv`)
Additional transcripts from the stt_transcripts collection without corresponding audio.
---
## Speech Types
| Type | Duration | Percentage |
|------|----------|------------|
| Read Speech | 26h 32m 37s | 96.4% |
| Spontaneous Speech | 59m 13s | 3.6% |
---
## Intended Uses
- Training automatic speech recognition (ASR) systems for Swahili
- Evaluating speech-to-text models
- Phonetic and linguistic research on Swahili
- Building text-to-speech (TTS) systems
- Transfer learning for other Bantu languages
---
## Dataset Curators
- **Dorcas Awino**
- **Dr. Benard Okal**
- **Khalid Kitito**
- **Owiny Japheth Otieno**
---
## Citation
```bibtex
@article{wanjawa2022kencorpus,
title={Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks},
author={Wanjawa, Barack W. and Wanzare, Lilian D. and Indede, Florence and McOnyango, Owen and Ombui, Edward and Muchemi, Lawrence},
journal={arXiv preprint arXiv:2208.12081},
year={2022}
}
```
---
## Links
- **Research Paper**: https://arxiv.org/abs/2208.12081
- **Dataverse**: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KLCKL5
---
## License
This dataset is licensed under **CC-BY-4.0**.
---
## Acknowledgments
This dataset is part of the **Kencorpus** project, which aims to create NLP and speech resources for low-resource Kenyan languages.
提供机构:
AndyOnyango



