fiifinketia/navigation-corpus-dagbani-speech
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/fiifinketia/navigation-corpus-dagbani-speech
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- dag
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
- text-to-speech
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
tags:
- speech
- dag
- ghana
- african-languages
- low-resource
- sentence-splits
- ctc-aligned
- vad-trimmed
pretty_name: Dag Sentence Speech Segments
---
# Dag Speech Segments (sentence splitting)
52799 speech-text pairs split from long recordings.
## Processing pipeline
1. Source audio from `ghananlpcommunity/navigation-corpus-speech-full-dagbani`
2. Full-file CTC forced alignment (MMS-300M) for word-level timestamps
3. Sentence-boundary splits (. ? !) — long sentences re-chunked to 16 words
4. Leading/trailing silence trimmed with VAD (-40 dBFS threshold)
5. Filtered: min 1.0s, max 15.0s
6. Original sample rate preserved
## Usage
```python
from datasets import load_dataset
ds = load_dataset("ghananlpcommunity/navigation-corpus-dagbani-speech", split="train")
```
---
language:
- 达格巴尼语(Dag)
license: CC BY 4.0
task_categories:
- 自动语音识别(automatic-speech-recognition)
- 文本转语音(text-to-speech)
multilinguality:
- 单语言(monolingual)
size_categories:
- 1000 < n < 10000
tags:
- 语音(speech)
- 达格巴尼语(Dag)
- 加纳(Ghana)
- 非洲语言(african-languages)
- 低资源语言(low-resource)
- 句子切分(sentence-splits)
- CTC对齐(ctc-aligned)
- VAD修剪(vad-trimmed)
pretty_name: 达格巴尼语句语音片段
---
# 达格巴尼语语音片段(句子切分版)
本数据集包含52799条语音-文本对,均从长录音中切分得到。
## 处理流程
1. 源音频取自`ghananlpcommunity/navigation-corpus-speech-full-dagbani`
2. 使用MMS-300M模型执行全文件CTC(Connectionist Temporal Classification,连接主义时间分类)强制对齐,以获取词级时间戳
3. 按照句子边界(.、?、!)进行切分;对于超长句子,将其重新分块为每段16个词的单元
4. 采用VAD(Voice Activity Detection,语音活动检测),以-40 dBFS为阈值,修剪音频首尾的静音片段
5. 过滤规则:保留时长介于1.0秒至15.0秒之间的样本
6. 全程保留音频的原始采样率
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("ghananlpcommunity/navigation-corpus-dagbani-speech", split="train")
提供机构:
fiifinketia



