资源简介:
# Dataset Card for Speech Wikimedia
## Table of Contents
- [Dataset Card for Speech Wikimedia](#dataset-card-for-speech-wikimedia)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Transcription languages](#transcription-languages)
- [Hours of Audio for each language](#hours-of-audio-for-each-language)
- [Hours of language pairs for speech translation](#hours-of-language-pairs-for-speech-translation)
- [Dataset Structure](#dataset-structure)
- [reformat](#reformat)
- [transcription and transcription_2](#transcription-and-transcription_2)
- [real_correspondence.json](#real_correspondencejson)
- [license.json](#licensejson)
- [Data License](#data-license)
- [Dataset Creation](#dataset-creation)
- [Source Data](#source-data)
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
- [Preprocessing](#preprocessing)
- [Annotations](#annotations)
- [Annotation process](#annotation-process)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Discussion of Biases](#discussion-of-biases)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
## Dataset Description
- **Point of Contact:** [datasets@mlcommons.org](mailto:datasets@mlcommons.org)
### Dataset Summary
The Speech Wikimedia Dataset is a compilation of audiofiles with transcriptions extracted from wikimedia commons that is licensed for academic and commercial usage under CC and Public domain. It includes 2,000+ hours of transcribed speech in different languages with a diverse set of speakers.
Each audiofile should have one or more transcriptions in different languages.
### Transcription languages
- English
- German
- Dutch
- Arabic
- Hindi
- Portuguese
- Spanish
- Polish
- French
- Russian
- Esperanto
- Swedish
- Korean
- Bengali
- Hungarian
- Oriya
- Thai
### Hours of Audio for each language
We determine the amount of data available for the ASR tasks by extractign the total duration of audio we have where we also have the transcription in the same langauge. We present the duration for the top 10 languages besides from English, from which we have a total of 1488 hours of audio:

### Hours of language pairs for speech translation
Our dataset contains some audios with more than one transcription, all of which correspond to a different language transcription. In total, we have 628 hours fo audio with transcripts in different languages. We present the hours of audio for the 20 most common language pairs:

## Dataset Structure
### audios
Folder with audios in flac format and sampling_rate=16,000 Hz.
### transcription and transcription2
Folders with transcriptions in srt format.
We split this into two directories because Hugging Face does not support more than 10,000 files in a single directory.
### real_correspondence.json
File with relationship between audios and transcriptions, as one large json dictionary.
Key is the name of an audio file in the "reformat" directory, value is the list of corresponding transcript files, which sit in either the transcription or transcription2 directory.
### license.json
File with license information. The key is the name of the original audio file on Wikimedia Commons.
### Data License
Here is an excerpt from license.json:
"""
'"Berlin Wall" Speech - President Reagan\'s Address at the Brandenburg Gate - 6-12-87.webm': {'author': '<td>\n<a class="external text" href="https://www.youtube.com/user/ReaganFoundation" rel="nofollow">ReaganFoundation</a></td>',
'source': '<td>\n<bdo dir="ltr" lang="en"><a href="/wiki/Commons:YouTube_files" [...],
'html_license': '[\'<table class="layouttemplate mw-content-ltr" lang="en" style="width:100%; [...],
'license': 'Public Domain'},
## Dataset Creation
### Source Data
#### Initial Data Collection and Normalization
Data was downloaded from https://commons.wikimedia.org/.
#### Preprocessing
As the original format of most of the files was video, we decided to convert them to flac format with samplerate=16_000 using ffmpeg
### Annotations
#### Annotation process
No manual annotation is done. We download only source audio with already existing transcripts.
In particular, there is no "forced alignment" or "segmentation" done on this dataset.
### Personal and Sensitive Information
Several of our sources are legal and government proceedings, spoken stories, speeches, and so on. Given that these were intended as public documents and licensed as such, it is natural that the involved individuals are aware of this.
## Considerations for Using the Data
### Discussion of Biases
Our data is downloaded from commons.wikimedia.org. As such, the data is biased towards whatever users decide to upload there.
The data is also mostly English, though this has potential for multitask learning because several audio files have more than one transcription.
## Additional Information
### Licensing Information
The source data contains data under Public Domain and Creative Commons Licenses.
We license this dataset under https://creativecommons.org/licenses/by-sa/4.0/
The appropriate attributions are in license.json
# Speech Wikimedia 数据集卡片
## 目录
- [Speech Wikimedia 数据集卡片](#speech-wikimedia-数据集卡片)
- [目录](#目录)
- [数据集说明](#数据集说明)
- [数据集总结](#数据集总结)
- [转录语言](#转录语言)
- [各语言音频时长](#各语言音频时长)
- [语音翻译任务的语言对时长](#语音翻译任务的语言对时长)
- [数据集结构](#数据集结构)
- [reformat](#reformat)
- [transcription 与 transcription_2](#transcription-与-transcription_2)
- [real_correspondence.json](#real_correspondencejson)
- [license.json](#licensejson)
- [数据授权](#数据授权)
- [数据集构建](#数据集构建)
- [源数据](#源数据)
- [初始数据收集与标准化](#初始数据收集与标准化)
- [预处理](#预处理)
- [标注信息](#标注信息)
- [标注流程](#标注流程)
- [个人与敏感信息](#个人与敏感信息)
- [数据集使用注意事项](#数据集使用注意事项)
- [偏差说明](#偏差说明)
- [附加信息](#附加信息)
- [授权信息](#授权信息)
## 数据集说明
- **联系人**:[datasets@mlcommons.org](mailto:datasets@mlcommons.org)
### 数据集总结
Speech Wikimedia 数据集是从维基媒体共享库(Wikimedia Commons)中提取的带转录文本的音频文件合集,采用知识共享(Creative Commons, CC)协议及公共领域(Public Domain)授权,可用于学术及商业用途。该数据集包含超过2000小时的多语言转录语音,涵盖多样化的发言者群体。每个音频文件应配有一种或多种语言的转录文本。
### 转录语言
- 英语
- 德语
- 荷兰语
- 阿拉伯语
- 印地语
- 葡萄牙语
- 西班牙语
- 波兰语
- 法语
- 俄语
- 世界语
- 瑞典语
- 韩语
- 孟加拉语
- 匈牙利语
- 奥里亚语
- 泰语
### 各语言音频时长
我们通过提取带有对应语言转录文本的音频总时长,来计算自动语音识别(Automatic Speech Recognition, ASR)任务可用的数据量。除英语外,我们列出了前10大语言的音频时长,英语相关音频总时长为1488小时:

### 语音翻译任务的语言对时长
本数据集包含部分带有多语言转录文本的音频文件,所有转录文本均对应不同语言。总计拥有628小时带有跨语言转录文本的音频。我们列出了20种最常见语言对的音频时长:

## 数据集结构
### audios
audios 文件夹:存储FLAC(无损音频编码格式)格式的音频文件,采样率为16000赫兹。
### transcription 与 transcription_2
transcription 与 transcription2 文件夹:存储SRT(字幕文件格式)格式的转录文本。我们将其拆分为两个目录,原因是Hugging Face不支持单个目录下超过10000个文件。
### real_correspondence.json
real_correspondence.json 文件:存储音频与转录文本的对应关系,为一个大型JSON字典。字典的键为“reformat”目录下的音频文件名,值为对应的转录文本文件列表,这些转录文本文件位于transcription或transcription2目录中。
### license.json
license.json 文件:存储授权信息。字典的键为维基媒体共享库(Wikimedia Commons)上的原始音频文件名。
### 数据授权
以下为 license.json 中的节选内容:
""Berlin Wall" Speech - President Reagan's Address at the Brandenburg Gate - 6-12-87.webm": {'author': '<td>
<a class="external text" href="https://www.youtube.com/user/ReaganFoundation" rel="nofollow">ReaganFoundation</a></td>',
'source': '<td>
<bdo dir="ltr" lang="en"><a href="/wiki/Commons:YouTube_files" [...],
'html_license': '[<table class="layouttemplate mw-content-ltr" lang="en" style="width:100%; [...],
'license': 'Public Domain'},
## 数据集构建
### 源数据
#### 初始数据收集与标准化
数据从 https://commons.wikimedia.org/ 下载获取。
#### 预处理
由于大多数原始文件为视频格式,我们使用FFmpeg将其转换为采样率为16000赫兹的FLAC格式音频。
### 标注信息
#### 标注流程
本数据集未进行人工标注,仅下载带有已有转录文本的源音频文件。特别地,本数据集未进行“强制对齐(forced alignment)”或“分段(segmentation)”处理。
### 个人与敏感信息
部分源数据来自合法庭审、政府会议、口述故事及演讲等内容。由于这些内容本就作为公开文档发布并获得相应授权,相关人员对此知情属于合理情况。
## 数据集使用注意事项
### 偏差说明
本数据集从维基媒体共享库(Wikimedia Commons)下载获取,因此数据的偏差取决于上传至该平台的用户内容。此外,数据以英语为主,但由于部分音频文件带有多语言转录文本,该数据集仍可用于多任务学习场景。
## 附加信息
### 授权信息
源数据包含公共领域及知识共享(Creative Commons, CC)协议下的内容。本数据集采用知识共享署名-相同方式共享4.0(CC BY-SA 4.0)协议进行授权,具体归属信息可参见 license.json 文件。