ghananlpcommunity/ghana-english-asr-2700hrs
收藏Hugging Face2026-03-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ghananlpcommunity/ghana-english-asr-2700hrs
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
tags:
- audio
- speech
- asr
- ghanaian-english
- west-african-english
task_categories:
- automatic-speech-recognition
pretty_name: Ghana English ASR Dataset
size_categories:
- 1K<n<10K
---
# 🇬🇭 Ghana English ASR Dataset
A speech dataset of **Ghanaian English** extracted from Ghanaian news media broadcasts,
designed for training and fine-tuning **Automatic Speech Recognition (ASR)** models on
West African English accents.
---
## 📂 Dataset Structure
| Column | Type | Description |
|-----------------|--------|--------------------------------------------------|
| `audio` | Audio | 16 kHz mono WAV audio segment |
| `corrected_text`| string | Verbatim transcription of the audio segment |
| `duration_ss` | float | Duration of the audio segment in seconds |
---
## 📊 Statistics
| Metric | Value |
|-------------------------|----------------------------------|
| Total clips | 729,476 |
| Total duration | **2706.77 hours** |
| Mean clip duration | 13.36 s |
| Min / Max clip duration | 0.24 s / 37.13 s |
| Mean words per clip | 33.4 |
| Min / Max words | 1 / 141 |
| Vocabulary size | 328,236 unique words |
| Sample rate | 16,000 Hz (mono) |
---
## 🚀 Usage
```python
from datasets import load_dataset
dataset = load_dataset("ghananlpcommunity/ghana-english-asr-2700hrs")
train = dataset["train"]
example = train[0]
print("Transcription:", example["corrected_text"])
print("Duration (s):", example["duration_ss"])
print("Audio array shape:", example["audio"]["array"].shape)
print("Sample rate:", example["audio"]["sampling_rate"])
```
### Fine-tuning with Whisper
```python
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
def prepare_batch(batch):
audio = batch["audio"]
batch["input_features"] = processor(
audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt"
).input_features[0]
batch["labels"] = processor.tokenizer(batch["corrected_text"]).input_ids
return batch
dataset = dataset.map(prepare_batch, remove_columns=dataset.column_names)
```
---
## 🎯 Intended Use Cases
- Fine-tuning Whisper, Wav2Vec2, MMS for **Ghanaian / West African English**
- Building accent-aware ASR pipelines for Ghanaian broadcast media
- Linguistic research on Ghanaian English phonology and prosody
- Low-resource African language / dialect ASR benchmarking
---
## ⚠️ Limitations
- Domain-specific: broadcast news only, may not generalise to conversational English.
- Speaker diversity not formally audited.
- Transcriptions may contain occasional errors in proper nouns.
---
## 📜 Citation
```bibtex
@dataset{ghana_english_asr,
author = {Owusu, Mich-Seth},
title = {Ghana English ASR Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ghananlpcommunity/ghana-english-asr-2700hrs}
}
```
---
## 🙏 Acknowledgments
Created by **Mich-Seth Owusu** for the **Ghana NLP Community**.
提供机构:
ghananlpcommunity



