five

fiifinketia/navigation-corpus-dagbani-speech

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/fiifinketia/navigation-corpus-dagbani-speech
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - dag license: cc-by-4.0 task_categories: - automatic-speech-recognition - text-to-speech multilinguality: - monolingual size_categories: - 1K<n<10K tags: - speech - dag - ghana - african-languages - low-resource - sentence-splits - ctc-aligned - vad-trimmed pretty_name: Dag Sentence Speech Segments --- # Dag Speech Segments (sentence splitting) 52799 speech-text pairs split from long recordings. ## Processing pipeline 1. Source audio from `ghananlpcommunity/navigation-corpus-speech-full-dagbani` 2. Full-file CTC forced alignment (MMS-300M) for word-level timestamps 3. Sentence-boundary splits (. ? !) — long sentences re-chunked to 16 words 4. Leading/trailing silence trimmed with VAD (-40 dBFS threshold) 5. Filtered: min 1.0s, max 15.0s 6. Original sample rate preserved ## Usage ```python from datasets import load_dataset ds = load_dataset("ghananlpcommunity/navigation-corpus-dagbani-speech", split="train") ```

--- language: - 达格巴尼语(Dag) license: CC BY 4.0 task_categories: - 自动语音识别(automatic-speech-recognition) - 文本转语音(text-to-speech) multilinguality: - 单语言(monolingual) size_categories: - 1000 < n < 10000 tags: - 语音(speech) - 达格巴尼语(Dag) - 加纳(Ghana) - 非洲语言(african-languages) - 低资源语言(low-resource) - 句子切分(sentence-splits) - CTC对齐(ctc-aligned) - VAD修剪(vad-trimmed) pretty_name: 达格巴尼语句语音片段 --- # 达格巴尼语语音片段(句子切分版) 本数据集包含52799条语音-文本对,均从长录音中切分得到。 ## 处理流程 1. 源音频取自`ghananlpcommunity/navigation-corpus-speech-full-dagbani` 2. 使用MMS-300M模型执行全文件CTC(Connectionist Temporal Classification,连接主义时间分类)强制对齐,以获取词级时间戳 3. 按照句子边界(.、?、!)进行切分;对于超长句子,将其重新分块为每段16个词的单元 4. 采用VAD(Voice Activity Detection,语音活动检测),以-40 dBFS为阈值,修剪音频首尾的静音片段 5. 过滤规则:保留时长介于1.0秒至15.0秒之间的样本 6. 全程保留音频的原始采样率 ## 使用方法 python from datasets import load_dataset ds = load_dataset("ghananlpcommunity/navigation-corpus-dagbani-speech", split="train")
提供机构:
fiifinketia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作