five

ghananlpcommunity/twi-speech-sota-240hrs

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ghananlpcommunity/twi-speech-sota-240hrs
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - tw - ak license: cc-by-4.0 tags: - audio - speech - tts - twi - akan - ghanaian-languages task_categories: - automatic-speech-recognition - text-to-speech pretty_name: Twi TTS Dataset size_categories: - 100K<n<1M --- # 🇬🇭 Twi ASR Dataset A speech dataset of **Twi (Akan)** extracted from Ghanaian news media broadcasts, designed for training and fine-tuning **Text-To-Speech (TTS)** models. --- ## 📂 Dataset Structure | Column | Type | Description | |-----------------|--------|--------------------------------------------------| | `audio` | Audio | 24 kHz mono WAV audio segment | | `text` | string | Verbatim Twi transcription of the audio segment | | `duration` | float | Duration of the audio segment in seconds | --- ## 📊 Statistics | Metric | Value | |-------------------------|----------------------------------| | Total clips | 132,212 | | Total duration | **237.71 hours** | | Mean clip duration | 6.47 s | | Min / Max clip duration | 1.01 s / 15.0 s | | Mean words per clip | 16.0 | | Min / Max words | 1 / 16 | | Vocabulary size | 42,970 unique words | | Sample rate | 24,000 Hz (mono) | --- ## 🚀 Usage ```python from datasets import load_dataset dataset = load_dataset("ghananlpcommunity/twi-speech-sota-240hrs") train = dataset["train"] example = train[0] print("Transcription:", example["text"]) print("Duration (s):", example["duration"]) print("Audio array shape:", example["audio"]["array"].shape) print("Sample rate:", example["audio"]["sampling_rate"]) ``` --- ## 🎯 Intended Use Cases - Building TTS models from scratch or finetuning for **Twi (Akan)** - Linguistic research on Twi phonology and prosody - Low-resource African language ASR benchmarking --- ## 📜 Citation ```bibtex @dataset{twi_asr, author = {Owusu, Mich-Seth}, title = {Twi ASR Dataset}, year = {2026}, publisher = {Hugging Face}, url = {[https://huggingface.co/datasets/](https://huggingface.co/datasets/)ghananlpcommunity/twi-speech-sota-200hrs} } ``` --- ## 🙏 Acknowledgments Created by **Mich-Seth Owusu** for the **Ghana NLP Community**.
提供机构:
ghananlpcommunity
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作