five

speedykom-group/pokoot-east-speech-dataset

收藏
Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/speedykom-group/pokoot-east-speech-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
Pokoot East语音数据集是一个用于支持肯尼亚南方尼罗特语(Kalenjin)中Pokoot East方言的语音数据集。数据集包含音频和转录文本,音频为WAV格式,16 kHz采样率,单声道;转录文本为UTF-8编码。数据集分为训练集(600个样本,60%)、验证集(200个样本,20%)和测试集(200个样本,20%)。数据来源于Global Recordings Network的GRN圣经叙述,通过静音检测分段。转录由facebook/mms-1b-all模型的Kpz适配器自动生成,建议人工审查和校正。数据集由Speedykom创建,旨在推动非洲弱势语言的语音技术发展。

The Pokoot East Speech Dataset is a speech dataset for the Pokoot East dialect, a Southern Nilotic (Kalenjin) language spoken in Kenya. It includes audio in WAV format (16 kHz, mono) and UTF-8 transcripts. The dataset is split into train (600 samples, 60%), validation (200 samples, 20%), and test (200 samples, 20%) sets. The audio is sourced from GRN Bible narratives by Global Recordings Network, segmented via silence detection. Transcriptions were auto-generated using the Kpz adapter of the `facebook/mms-1b-all` model, with manual review recommended. Created by Speedykom, this dataset aims to advance speech technology for underserved African languages.
提供机构:
speedykom-group
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作