AndyOnyango/KenSpeech

Name: AndyOnyango/KenSpeech
Creator: AndyOnyango
Published: 2026-04-10 06:10:45
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/AndyOnyango/KenSpeech

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - sw license: cc-by-4.0 task_categories: - automatic-speech-recognition tags: - swahili - kiswahili - speech-recognition - asr - stt - low-resource-languages - african-languages - kenya - speech-to-text pretty_name: KenSpeech size_categories: - 1K<n<10K --- # KenSpeech: A Swahili Speech Dataset for ASR ## Dataset Description **KenSpeech** is a comprehensive Swahili speech dataset containing both read and spontaneous speech recordings from native Swahili speakers in Kenya. This dataset is designed for training and evaluating automatic speech recognition (ASR) and speech-to-text (STT) systems for Swahili. ## Dataset Statistics | Metric | Value | |--------|-------| | Total Duration | 27 hours 31 minutes 50 seconds | | Read Speech Duration | 26 hours 32 minutes 37 seconds | | Spontaneous Speech Duration | 59 minutes 13 seconds | | Total Speakers | 26 | | Female Speakers | 19 | | Male Speakers | 7 | | Lexicon Words | 31,728+ | ## Audio Format | Property | Value | |----------|-------| | Sampling Rate | 16 kHz | | Channels | Mono | --- ## Dataset Format The dataset is distributed as **Parquet files** with embedded audio for optimal compatibility: - **Format**: Apache Parquet (with embedded audio bytes) - **Encoding**: UTF-8 for text fields - **Compatibility**: Works with `datasets` 4.0.0+ without custom loading scripts ## Data Fields | Column | Type | Description | |--------|------|-------------| | audio | Audio | Audio waveform (decoded array + sampling_rate) | | source_folder | string | Origin folder (`stt_dictionary` or `stt_transcripts`) | | gender | string | Speaker gender (`male` or `female`) | | speaker | string | Speaker identifier (`speaker_1`, `speaker_2`, etc.) | | transcript | string | Transcription text | ### Example Record ```python { 'audio': {'path': '...', 'array': array([0.001, -0.003, ...]), 'sampling_rate': 16000}, 'source_folder': 'stt_dictionary', 'gender': 'female', 'speaker': 'speaker_1', 'transcript': 'masaa mawili kabla basi kuwasili...' } ``` --- ## Usage ### Loading with Hugging Face Datasets ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("Kencorpus/KenSpeech") # Access a sample sample = dataset['train'][0] print(sample['transcript']) print(sample['gender']) print(sample['speaker']) print(sample['audio']['sampling_rate']) # 16000 print(sample['audio']['array'].shape) # audio waveform ``` ### Filtering by Gender ```python from datasets import load_dataset dataset = load_dataset("Kencorpus/KenSpeech") # Get female speakers only female_data = dataset['train'].filter(lambda x: x['gender'] == 'female') print(f"Female samples: {len(female_data)}") # Get male speakers only male_data = dataset['train'].filter(lambda x: x['gender'] == 'male') print(f"Male samples: {len(male_data)}") ``` ### Training an ASR Model ```python from datasets import load_dataset from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC # Load dataset dataset = load_dataset("Kencorpus/KenSpeech") # Load a multilingual model model_name = "facebook/wav2vec2-large-xlsr-53" processor = Wav2Vec2Processor.from_pretrained(model_name) model = Wav2Vec2ForCTC.from_pretrained(model_name) # Process a sample sample = dataset['train'][0] inputs = processor(sample['audio']['array'], sampling_rate=16000, return_tensors="pt") ``` --- ## Additional Resources ### Pronunciation Lexicon (`lexicon.csv`) A Swahili lexicon-phone dictionary with over 31,000 words and their phonetic transcriptions. **Format:** `word,phoneme_sequence` ``` wanapaswa,W AH N AH P AH S W AH wanasema,W AH N AH S EH M AH wanataka,W AH N AH T AH K AH ``` ### Transcript-only Data (`transcripts_only.csv`) Additional transcripts from the stt_transcripts collection without corresponding audio. --- ## Speech Types | Type | Duration | Percentage | |------|----------|------------| | Read Speech | 26h 32m 37s | 96.4% | | Spontaneous Speech | 59m 13s | 3.6% | --- ## Intended Uses - Training automatic speech recognition (ASR) systems for Swahili - Evaluating speech-to-text models - Phonetic and linguistic research on Swahili - Building text-to-speech (TTS) systems - Transfer learning for other Bantu languages --- ## Dataset Curators - **Dorcas Awino** - **Dr. Benard Okal** - **Khalid Kitito** - **Owiny Japheth Otieno** --- ## Citation ```bibtex @article{wanjawa2022kencorpus, title={Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks}, author={Wanjawa, Barack W. and Wanzare, Lilian D. and Indede, Florence and McOnyango, Owen and Ombui, Edward and Muchemi, Lawrence}, journal={arXiv preprint arXiv:2208.12081}, year={2022} } ``` --- ## Links - **Research Paper**: https://arxiv.org/abs/2208.12081 - **Dataverse**: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KLCKL5 --- ## License This dataset is licensed under **CC-BY-4.0**. --- ## Acknowledgments This dataset is part of the **Kencorpus** project, which aims to create NLP and speech resources for low-resource Kenyan languages.

提供机构：

AndyOnyango

5,000+

优质数据集

54 个

任务类型

进入经典数据集