MLCommons/speech-wikimedia

Name: MLCommons/speech-wikimedia
Creator: MLCommons
Published: 2023-06-29 18:28:23
License: 暂无描述

Hugging Face2023-06-29 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/MLCommons/speech-wikimedia

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Speech Wikimedia ## Table of Contents - [Dataset Card for Speech Wikimedia](#dataset-card-for-speech-wikimedia) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Transcription languages](#transcription-languages) - [Hours of Audio for each language](#hours-of-audio-for-each-language) - [Hours of language pairs for speech translation](#hours-of-language-pairs-for-speech-translation) - [Dataset Structure](#dataset-structure) - [reformat](#reformat) - [transcription and transcription_2](#transcription-and-transcription_2) - [real_correspondence.json](#real_correspondencejson) - [license.json](#licensejson) - [Data License](#data-license) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Preprocessing](#preprocessing) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Discussion of Biases](#discussion-of-biases) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) ## Dataset Description - **Point of Contact:** [datasets@mlcommons.org](mailto:datasets@mlcommons.org) ### Dataset Summary The Speech Wikimedia Dataset is a compilation of audiofiles with transcriptions extracted from wikimedia commons that is licensed for academic and commercial usage under CC and Public domain. It includes 2,000+ hours of transcribed speech in different languages with a diverse set of speakers. Each audiofile should have one or more transcriptions in different languages. ### Transcription languages - English - German - Dutch - Arabic - Hindi - Portuguese - Spanish - Polish - French - Russian - Esperanto - Swedish - Korean - Bengali - Hungarian - Oriya - Thai ### Hours of Audio for each language We determine the amount of data available for the ASR tasks by extractign the total duration of audio we have where we also have the transcription in the same langauge. We present the duration for the top 10 languages besides from English, from which we have a total of 1488 hours of audio: ![Audios](images/ASR_Top10_Non_Eng.jpg) ### Hours of language pairs for speech translation Our dataset contains some audios with more than one transcription, all of which correspond to a different language transcription. In total, we have 628 hours fo audio with transcripts in different languages. We present the hours of audio for the 20 most common language pairs: ![Pairs](images/Speech_Translation.jpg) ## Dataset Structure ### audios Folder with audios in flac format and sampling_rate=16,000 Hz. ### transcription and transcription2 Folders with transcriptions in srt format. We split this into two directories because Hugging Face does not support more than 10,000 files in a single directory. ### real_correspondence.json File with relationship between audios and transcriptions, as one large json dictionary. Key is the name of an audio file in the "reformat" directory, value is the list of corresponding transcript files, which sit in either the transcription or transcription2 directory. ### license.json File with license information. The key is the name of the original audio file on Wikimedia Commons. ### Data License Here is an excerpt from license.json: """ '"Berlin Wall" Speech - President Reagan\'s Address at the Brandenburg Gate - 6-12-87.webm': {'author': '<td>\n<a class="external text" href="https://www.youtube.com/user/ReaganFoundation" rel="nofollow">ReaganFoundation</a></td>', 'source': '<td>\n<bdo dir="ltr" lang="en"><a href="/wiki/Commons:YouTube_files" [...], 'html_license': '[\'<table class="layouttemplate mw-content-ltr" lang="en" style="width:100%; [...], 'license': 'Public Domain'}, ## Dataset Creation ### Source Data #### Initial Data Collection and Normalization Data was downloaded from https://commons.wikimedia.org/. #### Preprocessing As the original format of most of the files was video, we decided to convert them to flac format with samplerate=16_000 using ffmpeg ### Annotations #### Annotation process No manual annotation is done. We download only source audio with already existing transcripts. In particular, there is no "forced alignment" or "segmentation" done on this dataset. ### Personal and Sensitive Information Several of our sources are legal and government proceedings, spoken stories, speeches, and so on. Given that these were intended as public documents and licensed as such, it is natural that the involved individuals are aware of this. ## Considerations for Using the Data ### Discussion of Biases Our data is downloaded from commons.wikimedia.org. As such, the data is biased towards whatever users decide to upload there. The data is also mostly English, though this has potential for multitask learning because several audio files have more than one transcription. ## Additional Information ### Licensing Information The source data contains data under Public Domain and Creative Commons Licenses. We license this dataset under https://creativecommons.org/licenses/by-sa/4.0/ The appropriate attributions are in license.json

# Speech Wikimedia 数据集卡片 ## 目录 - [Speech Wikimedia 数据集卡片](#speech-wikimedia-数据集卡片) - [目录](#目录) - [数据集说明](#数据集说明) - [数据集总结](#数据集总结) - [转录语言](#转录语言) - [各语言音频时长](#各语言音频时长) - [语音翻译任务的语言对时长](#语音翻译任务的语言对时长) - [数据集结构](#数据集结构) - [reformat](#reformat) - [transcription 与 transcription_2](#transcription-与-transcription_2) - [real_correspondence.json](#real_correspondencejson) - [license.json](#licensejson) - [数据授权](#数据授权) - [数据集构建](#数据集构建) - [源数据](#源数据) - [初始数据收集与标准化](#初始数据收集与标准化) - [预处理](#预处理) - [标注信息](#标注信息) - [标注流程](#标注流程) - [个人与敏感信息](#个人与敏感信息) - [数据集使用注意事项](#数据集使用注意事项) - [偏差说明](#偏差说明) - [附加信息](#附加信息) - [授权信息](#授权信息) ## 数据集说明 - **联系人**：[datasets@mlcommons.org](mailto:datasets@mlcommons.org) ### 数据集总结 Speech Wikimedia 数据集是从维基媒体共享库（Wikimedia Commons）中提取的带转录文本的音频文件合集，采用知识共享（Creative Commons, CC）协议及公共领域（Public Domain）授权，可用于学术及商业用途。该数据集包含超过2000小时的多语言转录语音，涵盖多样化的发言者群体。每个音频文件应配有一种或多种语言的转录文本。 ### 转录语言 - 英语 - 德语 - 荷兰语 - 阿拉伯语 - 印地语 - 葡萄牙语 - 西班牙语 - 波兰语 - 法语 - 俄语 - 世界语 - 瑞典语 - 韩语 - 孟加拉语 - 匈牙利语 - 奥里亚语 - 泰语 ### 各语言音频时长我们通过提取带有对应语言转录文本的音频总时长，来计算自动语音识别（Automatic Speech Recognition, ASR）任务可用的数据量。除英语外，我们列出了前10大语言的音频时长，英语相关音频总时长为1488小时： ![Audios](images/ASR_Top10_Non_Eng.jpg) ### 语音翻译任务的语言对时长本数据集包含部分带有多语言转录文本的音频文件，所有转录文本均对应不同语言。总计拥有628小时带有跨语言转录文本的音频。我们列出了20种最常见语言对的音频时长： ![Pairs](images/Speech_Translation.jpg) ## 数据集结构 ### audios audios 文件夹：存储FLAC（无损音频编码格式）格式的音频文件，采样率为16000赫兹。 ### transcription 与 transcription_2 transcription 与 transcription2 文件夹：存储SRT（字幕文件格式）格式的转录文本。我们将其拆分为两个目录，原因是Hugging Face不支持单个目录下超过10000个文件。 ### real_correspondence.json real_correspondence.json 文件：存储音频与转录文本的对应关系，为一个大型JSON字典。字典的键为“reformat”目录下的音频文件名，值为对应的转录文本文件列表，这些转录文本文件位于transcription或transcription2目录中。 ### license.json license.json 文件：存储授权信息。字典的键为维基媒体共享库（Wikimedia Commons）上的原始音频文件名。 ### 数据授权以下为 license.json 中的节选内容： ""Berlin Wall" Speech - President Reagan's Address at the Brandenburg Gate - 6-12-87.webm": {'author': '<td> <a class="external text" href="https://www.youtube.com/user/ReaganFoundation" rel="nofollow">ReaganFoundation</a></td>', 'source': '<td> <bdo dir="ltr" lang="en"><a href="/wiki/Commons:YouTube_files" [...], 'html_license': '[<table class="layouttemplate mw-content-ltr" lang="en" style="width:100%; [...], 'license': 'Public Domain'}, ## 数据集构建 ### 源数据 #### 初始数据收集与标准化数据从 https://commons.wikimedia.org/ 下载获取。 #### 预处理由于大多数原始文件为视频格式，我们使用FFmpeg将其转换为采样率为16000赫兹的FLAC格式音频。 ### 标注信息 #### 标注流程本数据集未进行人工标注，仅下载带有已有转录文本的源音频文件。特别地，本数据集未进行“强制对齐（forced alignment）”或“分段（segmentation）”处理。 ### 个人与敏感信息部分源数据来自合法庭审、政府会议、口述故事及演讲等内容。由于这些内容本就作为公开文档发布并获得相应授权，相关人员对此知情属于合理情况。 ## 数据集使用注意事项 ### 偏差说明本数据集从维基媒体共享库（Wikimedia Commons）下载获取，因此数据的偏差取决于上传至该平台的用户内容。此外，数据以英语为主，但由于部分音频文件带有多语言转录文本，该数据集仍可用于多任务学习场景。 ## 附加信息 ### 授权信息源数据包含公共领域及知识共享（Creative Commons, CC）协议下的内容。本数据集采用知识共享署名-相同方式共享4.0（CC BY-SA 4.0）协议进行授权，具体归属信息可参见 license.json 文件。

提供机构：

MLCommons

原始信息汇总

数据集卡片：Speech Wikimedia

数据集描述

数据集概述

Speech Wikimedia 数据集是从维基共享资源中提取的带有转录文本的音频文件集合，适用于学术和商业用途，遵循 CC 和公共领域许可。该数据集包含超过 2,000 小时的多种语言转录语音，涵盖多样化的说话者群体。每个音频文件应有一个或多个不同语言的转录文本。

转录语言

英语
德语
荷兰语
阿拉伯语
印地语
葡萄牙语
西班牙语
波兰语
法语
俄语
世界语
瑞典语
韩语
孟加拉语
匈牙利语
奥里亚语
泰语

每种语言的音频时长

通过提取具有相同语言转录的音频总时长来确定 ASR 任务的数据量。除了英语外，我们提供了前 10 种语言的总时长，其中英语总时长为 1488 小时。

语音翻译的语言对时长

数据集中包含一些具有多种语言转录的音频，总计有 628 小时的音频具有不同语言的转录文本。我们提供了 20 种最常见语言对的音频时长。

数据集结构

音频

包含 flac 格式的音频文件，采样率为 16,000 Hz。

转录和 transcription2

包含 srt 格式的转录文本文件。由于 Hugging Face 不支持单个目录中超过 10,000 个文件，因此分为两个目录。

real_correspondence.json

包含音频文件与转录文本之间关系的 JSON 文件。键是 "reformat" 目录中的音频文件名，值是对应的转录文件列表，位于 transcription 或 transcription2 目录中。

license.json

包含许可信息的文件。键是维基共享资源上的原始音频文件名。

数据许可

数据集遵循公共领域和 Creative Commons 许可。我们使用 https://creativecommons.org/licenses/by-sa/4.0/ 许可此数据集，并在 license.json 中提供了适当的归属信息。

数据集创建

源数据

初始数据收集和规范化

数据从 https://commons.wikimedia.org/ 下载。

预处理

大多数原始文件为视频格式，我们决定将其转换为采样率为 16,000 Hz 的 flac 格式。

注释

注释过程

未进行手动注释，仅下载已包含转录文本的源音频。特别地，未对此数据集进行“强制对齐”或“分割”处理。

个人和敏感信息

我们的来源包括法律和政府程序、口头故事、演讲等。由于这些文件旨在作为公共文档并以此许可，因此涉及的个人自然知晓这一点。

使用数据的注意事项

偏见讨论

数据从 commons.wikimedia.org 下载，因此偏向于用户决定上传的内容。数据主要为英语，但由于多个音频文件具有多语言转录，因此具有多任务学习的潜力。

附加信息

许可信息

源数据包含公共领域和 Creative Commons 许可的数据。我们使用 https://creativecommons.org/licenses/by-sa/4.0/ 许可此数据集，并在 license.json 中提供了适当的归属信息。

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集