speedykom-group/karamojong-speech-dataset
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/speedykom-group/karamojong-speech-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- kdj
license: cc-by-nc-4.0
task_categories:
- text-to-speech
- automatic-speech-recognition
tags:
- karamojong
- ngarimojong
- ateker
- nilotic
- africa
- uganda
- mms
- vits
pretty_name: SpeedyKom Ng'akarimojong Speech Dataset
size_categories:
- n<1K
---
# Ng'akarimojong Speech Dataset
<img src="https://speedykom.de/speedykom-small.png" alt="Speedykom" width="150"/>
> Speech dataset for **Ng'akarimojong (kdj)** — Eastern Nilotic language, ~370,000 speakers, Karamoja, NE Uganda.
| Property | Value |
|---|---|
| **Format** | WAV, 16 kHz, mono / UTF-8 transcripts |
| **Source** | GRN recordings, segmented via silence detection |
| **Transcription** | Auto-generated via `facebook/mms-1b-all` (Teso adapter) |
## About
This dataset was created by [Speedykom](https://speedykom.de) as part of an effort to advance speech technology for underserved African languages.
Ng'akarimojong is spoken by approximately 370,000 people in the Karamoja sub-region of northeastern Uganda and belongs to the Ateker (Teso-Turkana) language cluster within the Eastern Nilotic family.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("speedykom-group/karamojong-speech-dataset")
print(ds["train"][0])
```
## Notes
- Transcriptions were generated using the **Teso (teo)** ASR adapter from `facebook/mms-1b-all` — the closest available language to Karamojong. Manual review by a native speaker is recommended.
- Audio sourced from publicly available GRN recordings. Please respect the original license terms.
## Citation
If you use this dataset, please credit:
```
Speedykom - karamojong-speech-dataset
https://huggingface.co/datasets/speedykom-group/karamojong-speech-dataset
Created by Speedykom (https://speedykom.de)
```
---
*Created by [Speedykom](https://speedykom.de)*
语言:
- 恩加里莫琼语(kdj)
许可协议:知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)
任务类别:
- 文本转语音(text-to-speech)
- 自动语音识别(automatic-speech-recognition)
标签:
- 卡拉莫琼语(karamojong)
- 恩加里莫琼语(ngarimojong)
- 阿特克尔语(ateker)
- 尼罗语支(nilotic)
- 非洲
- 乌干达
- 大规模多语言语音模型(Massively Multilingual Speech, MMS)
- VITS语音合成模型(VITS)
展示名称:SpeedyKom Ng'akarimojong语音数据集
规模类别:
- 样本量少于1000(n<1K)
# Ng'akarimojong语音数据集
<img src="https://speedykom.de/speedykom-small.png" alt="Speedykom" width="150"/>
> 本数据集面向**恩加里莫琼语(Ng'akarimojong,kdj)**——乌干达东北部卡拉莫贾地区使用的东尼罗语支语言,使用者约37万人。
| 属性 | 取值 |
|---|---|
| **格式** | WAV格式,16 kHz,单声道 / UTF-8 格式转录文本 |
| **数据来源** | 全球录音网络(Global Recordings Network, GRN)录音,通过静音检测完成分段 |
| **转录生成** | 基于`facebook/mms-1b-all`(特索语适配器)自动生成 |
## 关于本数据集
本数据集由[Speedykom](https://speedykom.de)开发,旨在推动服务不足的非洲语言的语音技术发展。恩加里莫琼语通行于乌干达东北部卡拉莫贾次区域,使用者约37万人,隶属于东尼罗语族下的阿特克尔(特索-图尔卡纳)语群。
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("speedykom-group/karamojong-speech-dataset")
print(ds["train"][0])
## 注意事项
- 转录文本基于`facebook/mms-1b-all`的**特索语(teo)自动语音识别适配器**生成——这是与卡拉莫琼语最接近的现有适配模型。建议由母语使用者进行人工审核。
- 音频素材源自公开可用的GRN录音,请遵守原许可协议条款。
## 引用方式
若使用本数据集,请引用如下信息:
Speedykom - karamojong-speech-dataset
https://huggingface.co/datasets/speedykom-group/karamojong-speech-dataset
Created by Speedykom (https://speedykom.de)
*由[Speedykom](https://speedykom.de)制作*
提供机构:
speedykom-group



