ghananlpcommunity/twi-speech-sota-240hrs

Name: ghananlpcommunity/twi-speech-sota-240hrs
Creator: ghananlpcommunity
Published: 2026-04-08 13:02:33
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ghananlpcommunity/twi-speech-sota-240hrs

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - tw - ak license: cc-by-4.0 tags: - audio - speech - tts - twi - akan - ghanaian-languages task_categories: - automatic-speech-recognition - text-to-speech pretty_name: Twi TTS Dataset size_categories: - 100K<n<1M --- # 🇬🇭 Twi ASR Dataset A speech dataset of **Twi (Akan)** extracted from Ghanaian news media broadcasts, designed for training and fine-tuning **Text-To-Speech (TTS)** models. --- ## 📂 Dataset Structure | Column | Type | Description | |-----------------|--------|--------------------------------------------------| | `audio` | Audio | 24 kHz mono WAV audio segment | | `text` | string | Verbatim Twi transcription of the audio segment | | `duration` | float | Duration of the audio segment in seconds | --- ## 📊 Statistics | Metric | Value | |-------------------------|----------------------------------| | Total clips | 132,212 | | Total duration | **237.71 hours** | | Mean clip duration | 6.47 s | | Min / Max clip duration | 1.01 s / 15.0 s | | Mean words per clip | 16.0 | | Min / Max words | 1 / 16 | | Vocabulary size | 42,970 unique words | | Sample rate | 24,000 Hz (mono) | --- ## 🚀 Usage ```python from datasets import load_dataset dataset = load_dataset("ghananlpcommunity/twi-speech-sota-240hrs") train = dataset["train"] example = train[0] print("Transcription:", example["text"]) print("Duration (s):", example["duration"]) print("Audio array shape:", example["audio"]["array"].shape) print("Sample rate:", example["audio"]["sampling_rate"]) ``` --- ## 🎯 Intended Use Cases - Building TTS models from scratch or finetuning for **Twi (Akan)** - Linguistic research on Twi phonology and prosody - Low-resource African language ASR benchmarking --- ## 📜 Citation ```bibtex @dataset{twi_asr, author = {Owusu, Mich-Seth}, title = {Twi ASR Dataset}, year = {2026}, publisher = {Hugging Face}, url = {[https://huggingface.co/datasets/](https://huggingface.co/datasets/)ghananlpcommunity/twi-speech-sota-200hrs} } ``` --- ## 🙏 Acknowledgments Created by **Mich-Seth Owusu** for the **Ghana NLP Community**.

提供机构：

ghananlpcommunity

5,000+

优质数据集

54 个

任务类型

进入经典数据集