five

ghananlpcommunity/ghana-english-speech-600hrs

收藏
Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ghananlpcommunity/ghana-english-speech-600hrs
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 tags: - audio - speech - asr - ghanaian-english - west-african-english task_categories: - automatic-speech-recognition pretty_name: Ghana English ASR Dataset size_categories: - 1K<n<10K --- # 🇬🇭 Ghana English ASR Dataset A speech dataset of **Ghanaian English** extracted from Ghanaian news media broadcasts, designed for training and fine-tuning **Automatic Speech Recognition (ASR)** models on West African English accents. --- ## 📂 Dataset Structure | Column | Type | Description | |-----------------|--------|--------------------------------------------------| | `audio` | Audio | 24 kHz mono WAV audio segment | | `corrected_text`| string | Verbatim transcription of the audio segment | | `duration_ss` | float | Duration of the audio segment in seconds | --- ## 📊 Statistics | Metric | Value | |-------------------------|----------------------------------| | Total clips | 406,094 | | Total duration | **619.52 hours** | | Mean clip duration | 5.49 s | | Min / Max clip duration | 2.0 s / 12.0 s | | Mean words per clip | 14.3 | | Min / Max words | 1 / 98 | | Vocabulary size | 84,832 unique words | | Sample rate | 24,000 Hz (mono) | --- ## 🚀 Usage ```python from datasets import load_dataset dataset = load_dataset("ghananlpcommunity/ghana-english-speech-600hrs") train = dataset["train"] example = train[0] print("Transcription:", example["corrected_text"]) print("Duration (s):", example["duration_ss"]) print("Audio array shape:", example["audio"]["array"].shape) print("Sample rate:", example["audio"]["sampling_rate"]) ``` ### Fine-tuning with Whisper ```python from transformers import WhisperProcessor processor = WhisperProcessor.from_pretrained("openai/whisper-small") def prepare_batch(batch): audio = batch["audio"] batch["input_features"] = processor( audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt" ).input_features[0] batch["labels"] = processor.tokenizer(batch["corrected_text"]).input_ids return batch dataset = dataset.map(prepare_batch, remove_columns=dataset.column_names) ``` --- ## 🎯 Intended Use Cases - Fine-tuning Whisper, Wav2Vec2, MMS for **Ghanaian / West African English** - Building accent-aware ASR pipelines for Ghanaian broadcast media - Linguistic research on Ghanaian English phonology and prosody - Low-resource African language / dialect ASR benchmarking --- ## ⚠️ Limitations - Domain-specific: broadcast news only, may not generalise to conversational English. - Speaker diversity not formally audited. - Transcriptions may contain occasional errors in proper nouns. --- ## 📜 Citation ```bibtex @dataset{ghana_english_asr, author = {Owusu, Mich-Seth}, title = {Ghana English ASR Dataset}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ghananlpcommunity/ghana-english-speech-600hrs} } ``` --- ## 🙏 Acknowledgments Created by **Mich-Seth Owusu** for the **Ghana NLP Community**.

--- 语言: - 英语 许可证:CC BY 4.0 标签: - 音频 - 语音 - 自动语音识别(Automatic Speech Recognition,ASR) - 加纳英语 - 西非英语 任务类别: - 自动语音识别 数据集名称:加纳英语ASR数据集 样本规模:1000 < 样本数 < 10000 --- # 🇬🇭 加纳英语自动语音识别数据集 本数据集为**加纳英语**语音数据集,源自加纳新闻媒体广播音频,旨在针对西非英语口音训练与微调**自动语音识别(Automatic Speech Recognition,ASR)**模型。 --- ## 📂 数据集结构 | 列名 | 数据类型 | 描述 | |-----------------|--------|--------------------------------------------------| | `audio` | 音频 | 24 kHz 单声道WAV音频片段 | | `corrected_text`| 字符串 | 音频片段的逐字转录文本 | | `duration_ss` | 浮点型 | 音频片段时长,单位为秒 | --- ## 📊 统计数据 | 指标 | 数值 | |-------------------------|----------------------------------| | 总音频片段数 | 406,094 | | 总时长 | **619.52 小时** | | 单片段平均时长 | 5.49 秒 | | 单片段最短/最长时长 | 2.0 秒 / 12.0 秒 | | 单片段平均词数 | 14.3 | | 单片段最少/最多词数 | 1 / 98 | | 词汇量 | 84,832 个唯一词汇 | | 采样率 | 24,000 Hz(单声道) | --- ## 🚀 使用方法 python from datasets import load_dataset dataset = load_dataset("ghananlpcommunity/ghana-english-speech-600hrs") train = dataset["train"] example = train[0] print("Transcription:", example["corrected_text"]) print("Duration (s):", example["duration_ss"]) print("Audio array shape:", example["audio"]["array"].shape) print("Sample rate:", example["audio"]["sampling_rate"]) ### 基于Whisper的微调 python from transformers import WhisperProcessor processor = WhisperProcessor.from_pretrained("openai/whisper-small") def prepare_batch(batch): audio = batch["audio"] batch["input_features"] = processor( audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt" ).input_features[0] batch["labels"] = processor.tokenizer(batch["corrected_text"]).input_ids return batch dataset = dataset.map(prepare_batch, remove_columns=dataset.column_names) --- ## 🎯 适用场景 - 针对**加纳/西非英语**微调Whisper、Wav2Vec2、MMS模型 - 构建适配加纳广播媒体的口音感知型ASR处理流水线 - 开展加纳英语音系与韵律的语言学研究 - 开展低资源非洲语言/方言的ASR基准测试 --- ## ⚠️ 局限性 - 领域局限性:仅覆盖广播新闻场景,无法泛化至日常会话英语。 - 未对说话人多样性进行正式审核。 - 转录文本中可能存在专有名词偶发错误。 --- ## 📜 引用 bibtex @dataset{ghana_english_asr, author = {Owusu, Mich-Seth}, title = {Ghana English ASR Dataset}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ghananlpcommunity/ghana-english-speech-600hrs} } --- ## 🙏 致谢 本数据集由**Mich-Seth Owusu**为**加纳自然语言处理社区**创建。
提供机构:
ghananlpcommunity
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作