Speech Wikimedia Dataset

Name: Speech Wikimedia Dataset
Creator: MLCommons
Published: 2023-08-30 10:14:49
License: 暂无描述

arXiv2023-08-30 更新2024-06-21 收录

下载链接：

https://huggingface.co/datasets/MLCommons/speech-wi

下载链接

链接失效反馈

官方服务：

资源简介：

Speech Wikimedia Dataset是由MLCommons创建的多语言音频数据集，包含1780小时的CC-BY-SA许可的转录语音，涵盖77种语言。数据集内容丰富，包括多种场景和说话者的录音，每段音频附有多种语言的转录文本，适用于语音识别、语音翻译和机器翻译模型的训练。创建过程中，数据从Wikimedia Commons下载并转换为16kHz单声道FLAC格式，确保数据质量和可用性。该数据集特别针对多语言数据的需求和学术及商业使用的适当许可进行设计，旨在解决语音处理领域的多语言数据稀缺问题。

The Speech Wikimedia Dataset is a multilingual audio dataset created by MLCommons, containing 1,780 hours of transcribed speech licensed under CC-BY-SA and covering 77 languages. Boasting rich content, the dataset includes recordings from diverse scenarios and speakers, with each audio clip paired with transcriptions in multiple languages, making it suitable for training speech recognition, speech translation and machine translation models. During its creation, the data was downloaded from Wikimedia Commons and converted to 16kHz mono FLAC format to ensure data quality and usability. Specifically designed to address the demand for multilingual data and meet appropriate licensing requirements for both academic and commercial use, this dataset aims to solve the problem of scarce multilingual data in the field of speech processing.

提供机构：

MLCommons

创建时间：

2023-08-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集