ghananlpcommunity/ghana-english-asr-2700hrs

Name: ghananlpcommunity/ghana-english-asr-2700hrs
Creator: ghananlpcommunity
Published: 2026-03-10 02:47:14
License: 暂无描述

Hugging Face2026-03-10 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ghananlpcommunity/ghana-english-asr-2700hrs

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en tags: - audio - speech - asr - ghanaian-english - west-african-english task_categories: - automatic-speech-recognition pretty_name: Ghana English ASR Dataset size_categories: - 1K<n<10K --- # 🇬🇭 Ghana English ASR Dataset A speech dataset of **Ghanaian English** extracted from Ghanaian news media broadcasts, designed for training and fine-tuning **Automatic Speech Recognition (ASR)** models on West African English accents. --- ## 📂 Dataset Structure | Column | Type | Description | |-----------------|--------|--------------------------------------------------| | `audio` | Audio | 16 kHz mono WAV audio segment | | `corrected_text`| string | Verbatim transcription of the audio segment | | `duration_ss` | float | Duration of the audio segment in seconds | --- ## 📊 Statistics | Metric | Value | |-------------------------|----------------------------------| | Total clips | 729,476 | | Total duration | **2706.77 hours** | | Mean clip duration | 13.36 s | | Min / Max clip duration | 0.24 s / 37.13 s | | Mean words per clip | 33.4 | | Min / Max words | 1 / 141 | | Vocabulary size | 328,236 unique words | | Sample rate | 16,000 Hz (mono) | --- ## 🚀 Usage ```python from datasets import load_dataset dataset = load_dataset("ghananlpcommunity/ghana-english-asr-2700hrs") train = dataset["train"] example = train[0] print("Transcription:", example["corrected_text"]) print("Duration (s):", example["duration_ss"]) print("Audio array shape:", example["audio"]["array"].shape) print("Sample rate:", example["audio"]["sampling_rate"]) ``` ### Fine-tuning with Whisper ```python from transformers import WhisperProcessor processor = WhisperProcessor.from_pretrained("openai/whisper-small") def prepare_batch(batch): audio = batch["audio"] batch["input_features"] = processor( audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt" ).input_features[0] batch["labels"] = processor.tokenizer(batch["corrected_text"]).input_ids return batch dataset = dataset.map(prepare_batch, remove_columns=dataset.column_names) ``` --- ## 🎯 Intended Use Cases - Fine-tuning Whisper, Wav2Vec2, MMS for **Ghanaian / West African English** - Building accent-aware ASR pipelines for Ghanaian broadcast media - Linguistic research on Ghanaian English phonology and prosody - Low-resource African language / dialect ASR benchmarking --- ## ⚠️ Limitations - Domain-specific: broadcast news only, may not generalise to conversational English. - Speaker diversity not formally audited. - Transcriptions may contain occasional errors in proper nouns. --- ## 📜 Citation ```bibtex @dataset{ghana_english_asr, author = {Owusu, Mich-Seth}, title = {Ghana English ASR Dataset}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ghananlpcommunity/ghana-english-asr-2700hrs} } ``` --- ## 🙏 Acknowledgments Created by **Mich-Seth Owusu** for the **Ghana NLP Community**.

提供机构：

ghananlpcommunity

5,000+

优质数据集

54 个

任务类型

进入经典数据集