Speech Wikimedia
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/MLCommons/speech-wikimedia
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是从维基媒体公共领域收集的音频和转录文本的汇编,涵盖了1780小时的转录语音,包含77种不同的语言。数据集中包含了在不同语言下具有一个或多个转录文本的音频文件,适用于多种任务,如自动语音识别和机器翻译。该数据集遵循CC-BY-SA许可协议,总规模达到195GB,适用于语音识别、语音翻译和机器翻译等任务。
This dataset is a compilation of audio and transcribed text collected from Wikimedia Commons, covering 1,780 hours of transcribed speech and encompassing 77 distinct languages. The dataset contains audio files paired with one or more transcriptions across different languages, which is applicable to multiple tasks such as automatic speech recognition and machine translation. Licensed under the CC-BY-SA license, the dataset has a total size of 195 GB and is suitable for tasks including speech recognition, speech translation, and machine translation.
提供机构:
Hugging Face



