ghananlpcommunity/ghana-english-speech-600hrs
收藏Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ghananlpcommunity/ghana-english-speech-600hrs
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
tags:
- audio
- speech
- asr
- ghanaian-english
- west-african-english
task_categories:
- automatic-speech-recognition
pretty_name: Ghana English ASR Dataset
size_categories:
- 1K<n<10K
---
# 🇬🇭 Ghana English ASR Dataset
A speech dataset of **Ghanaian English** extracted from Ghanaian news media broadcasts,
designed for training and fine-tuning **Automatic Speech Recognition (ASR)** models on
West African English accents.
---
## 📂 Dataset Structure
| Column | Type | Description |
|-----------------|--------|--------------------------------------------------|
| `audio` | Audio | 24 kHz mono WAV audio segment |
| `corrected_text`| string | Verbatim transcription of the audio segment |
| `duration_ss` | float | Duration of the audio segment in seconds |
---
## 📊 Statistics
| Metric | Value |
|-------------------------|----------------------------------|
| Total clips | 406,094 |
| Total duration | **619.52 hours** |
| Mean clip duration | 5.49 s |
| Min / Max clip duration | 2.0 s / 12.0 s |
| Mean words per clip | 14.3 |
| Min / Max words | 1 / 98 |
| Vocabulary size | 84,832 unique words |
| Sample rate | 24,000 Hz (mono) |
---
## 🚀 Usage
```python
from datasets import load_dataset
dataset = load_dataset("ghananlpcommunity/ghana-english-speech-600hrs")
train = dataset["train"]
example = train[0]
print("Transcription:", example["corrected_text"])
print("Duration (s):", example["duration_ss"])
print("Audio array shape:", example["audio"]["array"].shape)
print("Sample rate:", example["audio"]["sampling_rate"])
```
### Fine-tuning with Whisper
```python
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
def prepare_batch(batch):
audio = batch["audio"]
batch["input_features"] = processor(
audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt"
).input_features[0]
batch["labels"] = processor.tokenizer(batch["corrected_text"]).input_ids
return batch
dataset = dataset.map(prepare_batch, remove_columns=dataset.column_names)
```
---
## 🎯 Intended Use Cases
- Fine-tuning Whisper, Wav2Vec2, MMS for **Ghanaian / West African English**
- Building accent-aware ASR pipelines for Ghanaian broadcast media
- Linguistic research on Ghanaian English phonology and prosody
- Low-resource African language / dialect ASR benchmarking
---
## ⚠️ Limitations
- Domain-specific: broadcast news only, may not generalise to conversational English.
- Speaker diversity not formally audited.
- Transcriptions may contain occasional errors in proper nouns.
---
## 📜 Citation
```bibtex
@dataset{ghana_english_asr,
author = {Owusu, Mich-Seth},
title = {Ghana English ASR Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ghananlpcommunity/ghana-english-speech-600hrs}
}
```
---
## 🙏 Acknowledgments
Created by **Mich-Seth Owusu** for the **Ghana NLP Community**.
---
语言:
- 英语
许可证:CC BY 4.0
标签:
- 音频
- 语音
- 自动语音识别(Automatic Speech Recognition,ASR)
- 加纳英语
- 西非英语
任务类别:
- 自动语音识别
数据集名称:加纳英语ASR数据集
样本规模:1000 < 样本数 < 10000
---
# 🇬🇭 加纳英语自动语音识别数据集
本数据集为**加纳英语**语音数据集,源自加纳新闻媒体广播音频,旨在针对西非英语口音训练与微调**自动语音识别(Automatic Speech Recognition,ASR)**模型。
---
## 📂 数据集结构
| 列名 | 数据类型 | 描述 |
|-----------------|--------|--------------------------------------------------|
| `audio` | 音频 | 24 kHz 单声道WAV音频片段 |
| `corrected_text`| 字符串 | 音频片段的逐字转录文本 |
| `duration_ss` | 浮点型 | 音频片段时长,单位为秒 |
---
## 📊 统计数据
| 指标 | 数值 |
|-------------------------|----------------------------------|
| 总音频片段数 | 406,094 |
| 总时长 | **619.52 小时** |
| 单片段平均时长 | 5.49 秒 |
| 单片段最短/最长时长 | 2.0 秒 / 12.0 秒 |
| 单片段平均词数 | 14.3 |
| 单片段最少/最多词数 | 1 / 98 |
| 词汇量 | 84,832 个唯一词汇 |
| 采样率 | 24,000 Hz(单声道) |
---
## 🚀 使用方法
python
from datasets import load_dataset
dataset = load_dataset("ghananlpcommunity/ghana-english-speech-600hrs")
train = dataset["train"]
example = train[0]
print("Transcription:", example["corrected_text"])
print("Duration (s):", example["duration_ss"])
print("Audio array shape:", example["audio"]["array"].shape)
print("Sample rate:", example["audio"]["sampling_rate"])
### 基于Whisper的微调
python
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
def prepare_batch(batch):
audio = batch["audio"]
batch["input_features"] = processor(
audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt"
).input_features[0]
batch["labels"] = processor.tokenizer(batch["corrected_text"]).input_ids
return batch
dataset = dataset.map(prepare_batch, remove_columns=dataset.column_names)
---
## 🎯 适用场景
- 针对**加纳/西非英语**微调Whisper、Wav2Vec2、MMS模型
- 构建适配加纳广播媒体的口音感知型ASR处理流水线
- 开展加纳英语音系与韵律的语言学研究
- 开展低资源非洲语言/方言的ASR基准测试
---
## ⚠️ 局限性
- 领域局限性:仅覆盖广播新闻场景,无法泛化至日常会话英语。
- 未对说话人多样性进行正式审核。
- 转录文本中可能存在专有名词偶发错误。
---
## 📜 引用
bibtex
@dataset{ghana_english_asr,
author = {Owusu, Mich-Seth},
title = {Ghana English ASR Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ghananlpcommunity/ghana-english-speech-600hrs}
}
---
## 🙏 致谢
本数据集由**Mich-Seth Owusu**为**加纳自然语言处理社区**创建。
提供机构:
ghananlpcommunity



