open-asr-leaderboard/datasets
收藏Hugging Face2023-08-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/open-asr-leaderboard/datasets
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
- crowdsourced
- machine-generated
language:
- en
language_creators:
- crowdsourced
- expert-generated
license:
- cc-by-4.0
- apache-2.0
- cc0-1.0
- cc-by-nc-3.0
- other
multilinguality:
- monolingual
pretty_name: datasets
size_categories:
- 100K<n<1M
- 1M<n<10M
source_datasets:
- original
- extended|librispeech_asr
- extended|common_voice
tags:
- asr
- benchmark
- speech
- esb
task_categories:
- automatic-speech-recognition
extra_gated_prompt: |-
Three of the ESB datasets have specific terms of usage that must be agreed to before using the data.
To do so, fill in the access forms on the specific datasets' pages:
* Common Voice: https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0
* GigaSpeech: https://huggingface.co/datasets/speechcolab/gigaspeech
* SPGISpeech: https://huggingface.co/datasets/kensho/spgispeech
extra_gated_fields:
I hereby confirm that I have registered on the original Common Voice page and agree to not attempt to determine the identity of speakers in the Common Voice dataset: checkbox
I hereby confirm that I have accepted the terms of usages on GigaSpeech page: checkbox
I hereby confirm that I have accepted the terms of usages on SPGISpeech page: checkbox
---
All eight of datasets in ESB can be downloaded and prepared in just a single line of code through the Hugging Face Datasets library:
```python
from datasets import load_dataset
librispeech = load_dataset("esb/datasets", "librispeech", split="train")
```
- `"esb/datasets"`: the repository namespace. This is fixed for all ESB datasets.
- `"librispeech"`: the dataset name. This can be changed to any of any one of the eight datasets in ESB to download that dataset.
- `split="train"`: the split. Set this to one of train/validation/test to generate a specific split. Omit the `split` argument to generate all splits for a dataset.
The datasets are full prepared, such that the audio and transcription files can be used directly in training/evaluation scripts.
## Dataset Information
A data point can be accessed by indexing the dataset object loaded through `load_dataset`:
```python
print(librispeech[0])
```
A typical data point comprises the path to the audio file and its transcription. Also included is information of the dataset from which the sample derives and a unique identifier name:
```python
{
'dataset': 'librispeech',
'audio': {'path': '/home/sanchit-gandhi/.cache/huggingface/datasets/downloads/extracted/d2da1969fe9e7d06661b5dc370cf2e3c119a14c35950045bcb76243b264e4f01/374-180298-0000.flac',
'array': array([ 7.01904297e-04, 7.32421875e-04, 7.32421875e-04, ...,
-2.74658203e-04, -1.83105469e-04, -3.05175781e-05]),
'sampling_rate': 16000},
'text': 'chapter sixteen i might have told you of the beginning of this liaison in a few lines but i wanted you to see every step by which we came i to agree to whatever marguerite wished',
'id': '374-180298-0000'
}
```
### Data Fields
- `dataset`: name of the ESB dataset from which the sample is taken.
- `audio`: a dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.
- `text`: the transcription of the audio file.
- `id`: unique id of the data sample.
### Data Preparation
#### Audio
The audio for all ESB datasets is segmented into sample lengths suitable for training ASR systems. The Hugging Face datasets library decodes audio files on the fly, reading the segments and converting them to a Python arrays. Consequently, no further preparation of the audio is required to be used in training/evaluation scripts.
Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, i.e. `dataset[0]["audio"]` should always be preferred over `dataset["audio"][0]`.
#### Transcriptions
The transcriptions corresponding to each audio file are provided in their 'error corrected' format. No transcription pre-processing is applied to the text, only necessary 'error correction' steps such as removing junk tokens (_<unk>_) or converting symbolic punctuation to spelled out form (_<comma>_ to _,_). As such, no further preparation of the transcriptions is required to be used in training/evaluation scripts.
Transcriptions are provided for training and validation splits. The transcriptions are **not** provided for the test splits. ESB requires you to generate predictions for the test sets and upload them to https://huggingface.co/spaces/esb/leaderboard for scoring.
### Access
All eight of the datasets in ESB are accessible and licensing is freely available. Three of the ESB datasets have specific terms of usage that must be agreed to before using the data. To do so, fill in the access forms on the specific datasets' pages:
* Common Voice: https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0
* GigaSpeech: https://huggingface.co/datasets/speechcolab/gigaspeech
* SPGISpeech: https://huggingface.co/datasets/kensho/spgispeech
### Diagnostic Dataset
ESB contains a small, 8h diagnostic dataset of in-domain validation data with newly annotated transcriptions. The audio data is sampled from each of the ESB validation sets, giving a range of different domains and speaking styles. The transcriptions are annotated according to a consistent style guide with two formats: normalised and un-normalised. The dataset is structured in the same way as the ESB dataset, by grouping audio-transcription samples according to the dataset from which they were taken. We encourage participants to use this dataset when evaluating their systems to quickly assess performance on a range of different speech recognition conditions. For more information, visit: [esb/diagnostic-dataset](https://huggingface.co/datasets/esb/diagnostic-dataset).
## Summary of ESB Datasets
| Dataset | Domain | Speaking Style | Train (h) | Dev (h) | Test (h) | Transcriptions | License |
|--------------|-----------------------------|-----------------------|-----------|---------|----------|--------------------|-----------------|
| LibriSpeech | Audiobook | Narrated | 960 | 11 | 11 | Normalised | CC-BY-4.0 |
| Common Voice | Wikipedia | Narrated | 1409 | 27 | 27 | Punctuated & Cased | CC0-1.0 |
| Voxpopuli | European Parliament | Oratory | 523 | 5 | 5 | Punctuated | CC0 |
| TED-LIUM | TED talks | Oratory | 454 | 2 | 3 | Normalised | CC-BY-NC-ND 3.0 |
| GigaSpeech | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500 | 12 | 40 | Punctuated | apache-2.0 |
| SPGISpeech | Fincancial meetings | Oratory, spontaneous | 4900 | 100 | 100 | Punctuated & Cased | User Agreement |
| Earnings-22 | Fincancial meetings | Oratory, spontaneous | 105 | 5 | 5 | Punctuated & Cased | CC-BY-SA-4.0 |
| AMI | Meetings | Spontaneous | 78 | 9 | 9 | Punctuated & Cased | CC-BY-4.0 |
## LibriSpeech
The LibriSpeech corpus is a standard large-scale corpus for assessing ASR systems. It consists of approximately 1,000 hours of narrated audiobooks from the [LibriVox](https://librivox.org) project. It is licensed under CC-BY-4.0.
Example Usage:
```python
librispeech = load_dataset("esb/datasets", "librispeech")
```
Train/validation splits:
- `train` (combination of `train.clean.100`, `train.clean.360` and `train.other.500`)
- `validation.clean`
- `validation.other`
Test splits:
- `test.clean`
- `test.other`
Also available are subsets of the train split, which can be accessed by setting the `subconfig` argument:
```python
librispeech = load_dataset("esb/datasets", "librispeech", subconfig="clean.100")
```
- `clean.100`: 100 hours of training data from the 'clean' subset
- `clean.360`: 360 hours of training data from the 'clean' subset
- `other.500`: 500 hours of training data from the 'other' subset
## Common Voice
Common Voice is a series of crowd-sourced open-licensed speech datasets where speakers record text from Wikipedia in various languages. The speakers are of various nationalities and native languages, with different accents and recording conditions. We use the English subset of version 9.0 (27-4-2022), with approximately 1,400 hours of audio-transcription data. It is licensed under CC0-1.0.
Example usage:
```python
common_voice = load_dataset("esb/datasets", "common_voice", use_auth_token=True)
```
Training/validation splits:
- `train`
- `validation`
Test splits:
- `test`
## VoxPopuli
VoxPopuli is a large-scale multilingual speech corpus consisting of political data sourced from 2009-2020 European Parliament event recordings. The English subset contains approximately 550 hours of speech largely from non-native English speakers. It is licensed under CC0.
Example usage:
```python
voxpopuli = load_dataset("esb/datasets", "voxpopuli")
```
Training/validation splits:
- `train`
- `validation`
Test splits:
- `test`
## TED-LIUM
TED-LIUM consists of English-language TED Talk conference videos covering a range of different cultural, political, and academic topics. It contains approximately 450 hours of transcribed speech data. It is licensed under CC-BY-NC-ND 3.0.
Example usage:
```python
tedlium = load_dataset("esb/datasets", "tedlium")
```
Training/validation splits:
- `train`
- `validation`
Test splits:
- `test`
## GigaSpeech
GigaSpeech is a multi-domain English speech recognition corpus created from audiobooks, podcasts and YouTube. We provide the large train set (2,500 hours) and the standard validation and test splits. It is licensed under apache-2.0.
Example usage:
```python
gigaspeech = load_dataset("esb/datasets", "gigaspeech", use_auth_token=True)
```
Training/validation splits:
- `train` (`l` subset of training data (2,500 h))
- `validation`
Test splits:
- `test`
Also available are subsets of the train split, which can be accessed by setting the `subconfig` argument:
```python
gigaspeech = load_dataset("esb/datasets", "spgispeech", subconfig="xs", use_auth_token=True)
```
- `xs`: extra-small subset of training data (10 h)
- `s`: small subset of training data (250 h)
- `m`: medium subset of training data (1,000 h)
- `xl`: extra-large subset of training data (10,000 h)
## SPGISpeech
SPGISpeech consists of company earnings calls that have been manually transcribed by S&P Global, Inc according to a professional style guide. We provide the large train set (5,000 hours) and the standard validation and test splits. It is licensed under a Kensho user agreement.
Loading the dataset requires authorization.
Example usage:
```python
spgispeech = load_dataset("esb/datasets", "spgispeech", use_auth_token=True)
```
Training/validation splits:
- `train` (`l` subset of training data (~5,000 h))
- `validation`
Test splits:
- `test`
Also available are subsets of the train split, which can be accessed by setting the `subconfig` argument:
```python
spgispeech = load_dataset("esb/datasets", "spgispeech", subconfig="s", use_auth_token=True)
```
- `s`: small subset of training data (~200 h)
- `m`: medium subset of training data (~1,000 h)
## Earnings-22
Earnings-22 is a 119-hour corpus of English-language earnings calls collected from global companies, with speakers of many different nationalities and accents. It is licensed under CC-BY-SA-4.0.
Example usage:
```python
earnings22 = load_dataset("esb/datasets", "earnings22")
```
Training/validation splits:
- `train`
- `validation`
Test splits:
- `test`
## AMI
The AMI Meeting Corpus consists of 100 hours of meeting recordings from multiple recording devices synced to a common timeline. It is licensed under CC-BY-4.0.
Example usage:
```python
ami = load_dataset("esb/datasets", "ami")
```
Training/validation splits:
- `train`
- `validation`
Test splits:
- `test`
提供机构:
open-asr-leaderboard
原始信息汇总
数据集概述
基本信息
- 名称: datasets
- 语言: 英语 (en)
- 语言创建方式: 众包 (crowdsourced) 和专家生成 (expert-generated)
- 许可证: cc-by-4.0, apache-2.0, cc0-1.0, cc-by-nc-3.0, other
- 多语言性: 单语 (monolingual)
- 大小: 100K<n<1M 和 1M<n<10M
- 源数据集: 原始, 扩展自 librispeech_asr 和 common_voice
- 标签: asr, benchmark, speech, esb
- 任务类别: 自动语音识别
数据集内容
- 数据字段:
dataset: 样本来源的ESB数据集名称。audio: 包含下载的音频文件路径、解码后的音频数组和采样率。text: 音频文件的转录文本。id: 数据样本的唯一标识符。
数据准备
- 音频: 所有ESB数据集的音频被分割成适合训练ASR系统的样本长度。音频文件即时解码并转换为Python数组,无需进一步准备即可用于训练/评估脚本。
- 转录: 转录文本以“错误修正”格式提供,无需进一步预处理即可用于训练/评估脚本。训练和验证分割提供转录,测试分割不提供转录。
访问与使用
- 所有八个ESB数据集均可自由访问,但其中三个数据集(Common Voice, GigaSpeech, SPGISpeech)有特定的使用条款,需在数据集页面填写访问表单同意后方可使用。
诊断数据集
- ESB包含一个8小时的诊断数据集,用于验证数据,包含来自ESB验证集的不同领域和说话风格的音频数据。转录遵循一致的风格指南,提供标准化和非标准化两种格式。
数据集详细信息
| 数据集 | 领域 | 说话风格 | 训练时长 (h) | 开发时长 (h) | 测试时长 (h) | 转录格式 | 许可证 |
|---|---|---|---|---|---|---|---|
| LibriSpeech | 有声书 | 叙述式 | 960 | 11 | 11 | 标准化 | CC-BY-4.0 |
| Common Voice | 维基百科 | 叙述式 | 1409 | 27 | 27 | 标点符号 & 大小写 | CC0-1.0 |
| Voxpopuli | 欧洲议会 | 演讲式 | 523 | 5 | 5 | 标点符号 | CC0 |
| TED-LIUM | TED演讲 | 演讲式 | 454 | 2 | 3 | 标准化 | CC-BY-NC-ND 3.0 |
| GigaSpeech | 有声书, 播客, YouTube | 叙述式, 自发式 | 2500 | 12 | 40 | 标点符号 | apache-2.0 |
| SPGISpeech | 财务会议 | 演讲式, 自发式 | 4900 | 100 | 100 | 标点符号 & 大小写 | 用户协议 |
| Earnings-22 | 财务会议 | 演讲式, 自发式 | 105 | 5 | 5 | 标点符号 & 大小写 | CC-BY-SA-4.0 |
| AMI | 会议 | 自发式 | 78 | 9 | 9 | 标点符号 & 大小写 | CC-BY-4.0 |
数据集加载示例
-
使用
load_dataset函数加载数据集,例如: python librispeech = load_dataset("esb/datasets", "librispeech") -
可根据需要选择不同的数据集名称和分割(train/validation/test)。
数据集子集
- 部分数据集提供子集配置选项,如
librispeech和gigaspeech,可通过设置subconfig参数访问特定子集。
搜集汇总
背景与挑战
背景概述
该数据集是ESB(Evaluation on Speech Benchmark)项目下的自动语音识别(ASR)基准数据集集合,包含八个子数据集(如LibriSpeech、Common Voice等),覆盖多个领域和说话风格,总训练时长从78小时到5000小时不等,使用多种开源许可证。数据集已预处理为适合ASR训练和评估的格式,支持通过Hugging Face Datasets库便捷下载,但部分子数据集需要额外访问授权。
以上内容由遇见数据集搜集并总结生成



