ssxenon01/mn-voice
收藏Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/ssxenon01/mn-voice
下载链接
链接失效反馈官方服务:
资源简介:
Mongolian Voice数据集是一个合并并清洗过的蒙古语自动语音识别(ASR)语料库,结合了Mozilla Common Voice(mn)和Google FLEURS(mn_mn)。数据集中的音频统一为16 kHz单声道PCM-16 WAV字节,转录文本通过一个规范化的处理器,以便任何ASR模型的假设都可以与一个参考文本进行评分。该数据集是speech-train项目中蒙古语ASR数据的唯一真实来源,以CC-BY-4.0许可发布,并包含丰富的元数据列用于加载时切片。数据集包含三个配置:default(合并语料库)、cv(仅Common Voice)和fleurs(仅FLEURS),每个配置都有训练、验证和测试分割。该数据集适用于MMS适配器、Whisper LoRA和parakeet微调,但不适用于现代ASR架构的从头开始训练。
The Mongolian Voice dataset is a merged and cleaned Mongolian Automatic Speech Recognition (ASR) corpus combining Mozilla Common Voice (mn) and Google FLEURS (mn_mn). The audio is unified to 16 kHz mono PCM-16 WAV bytes, and transcripts are passed through a canonical normaliser so that hypotheses from any ASR model can be scored against one reference text. This dataset is the single source of truth for Mongolian ASR data in the speech-train project, released under CC-BY-4.0, and includes rich metadata columns for load-time slicing. It consists of three configs: default (merged corpus), cv (Common Voice only), and fleurs (FLEURS only), each with train, validation, and test splits. The dataset is suitable for MMS adapter, Whisper LoRA, and parakeet fine-tune but not for from-scratch training of modern ASR architectures.
提供机构:
ssxenon01



