five

open-asr-leaderboard/datasets

收藏
Hugging Face2023-08-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/open-asr-leaderboard/datasets
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated - crowdsourced - machine-generated language: - en language_creators: - crowdsourced - expert-generated license: - cc-by-4.0 - apache-2.0 - cc0-1.0 - cc-by-nc-3.0 - other multilinguality: - monolingual pretty_name: datasets size_categories: - 100K<n<1M - 1M<n<10M source_datasets: - original - extended|librispeech_asr - extended|common_voice tags: - asr - benchmark - speech - esb task_categories: - automatic-speech-recognition extra_gated_prompt: |- Three of the ESB datasets have specific terms of usage that must be agreed to before using the data. To do so, fill in the access forms on the specific datasets' pages: * Common Voice: https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0 * GigaSpeech: https://huggingface.co/datasets/speechcolab/gigaspeech * SPGISpeech: https://huggingface.co/datasets/kensho/spgispeech extra_gated_fields: I hereby confirm that I have registered on the original Common Voice page and agree to not attempt to determine the identity of speakers in the Common Voice dataset: checkbox I hereby confirm that I have accepted the terms of usages on GigaSpeech page: checkbox I hereby confirm that I have accepted the terms of usages on SPGISpeech page: checkbox --- All eight of datasets in ESB can be downloaded and prepared in just a single line of code through the Hugging Face Datasets library: ```python from datasets import load_dataset librispeech = load_dataset("esb/datasets", "librispeech", split="train") ``` - `"esb/datasets"`: the repository namespace. This is fixed for all ESB datasets. - `"librispeech"`: the dataset name. This can be changed to any of any one of the eight datasets in ESB to download that dataset. - `split="train"`: the split. Set this to one of train/validation/test to generate a specific split. Omit the `split` argument to generate all splits for a dataset. The datasets are full prepared, such that the audio and transcription files can be used directly in training/evaluation scripts. ## Dataset Information A data point can be accessed by indexing the dataset object loaded through `load_dataset`: ```python print(librispeech[0]) ``` A typical data point comprises the path to the audio file and its transcription. Also included is information of the dataset from which the sample derives and a unique identifier name: ```python { 'dataset': 'librispeech', 'audio': {'path': '/home/sanchit-gandhi/.cache/huggingface/datasets/downloads/extracted/d2da1969fe9e7d06661b5dc370cf2e3c119a14c35950045bcb76243b264e4f01/374-180298-0000.flac', 'array': array([ 7.01904297e-04, 7.32421875e-04, 7.32421875e-04, ..., -2.74658203e-04, -1.83105469e-04, -3.05175781e-05]), 'sampling_rate': 16000}, 'text': 'chapter sixteen i might have told you of the beginning of this liaison in a few lines but i wanted you to see every step by which we came i to agree to whatever marguerite wished', 'id': '374-180298-0000' } ``` ### Data Fields - `dataset`: name of the ESB dataset from which the sample is taken. - `audio`: a dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. - `text`: the transcription of the audio file. - `id`: unique id of the data sample. ### Data Preparation #### Audio The audio for all ESB datasets is segmented into sample lengths suitable for training ASR systems. The Hugging Face datasets library decodes audio files on the fly, reading the segments and converting them to a Python arrays. Consequently, no further preparation of the audio is required to be used in training/evaluation scripts. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, i.e. `dataset[0]["audio"]` should always be preferred over `dataset["audio"][0]`. #### Transcriptions The transcriptions corresponding to each audio file are provided in their 'error corrected' format. No transcription pre-processing is applied to the text, only necessary 'error correction' steps such as removing junk tokens (_&lt;unk>_) or converting symbolic punctuation to spelled out form (_&lt;comma>_ to _,_). As such, no further preparation of the transcriptions is required to be used in training/evaluation scripts. Transcriptions are provided for training and validation splits. The transcriptions are **not** provided for the test splits. ESB requires you to generate predictions for the test sets and upload them to https://huggingface.co/spaces/esb/leaderboard for scoring. ### Access All eight of the datasets in ESB are accessible and licensing is freely available. Three of the ESB datasets have specific terms of usage that must be agreed to before using the data. To do so, fill in the access forms on the specific datasets' pages: * Common Voice: https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0 * GigaSpeech: https://huggingface.co/datasets/speechcolab/gigaspeech * SPGISpeech: https://huggingface.co/datasets/kensho/spgispeech ### Diagnostic Dataset ESB contains a small, 8h diagnostic dataset of in-domain validation data with newly annotated transcriptions. The audio data is sampled from each of the ESB validation sets, giving a range of different domains and speaking styles. The transcriptions are annotated according to a consistent style guide with two formats: normalised and un-normalised. The dataset is structured in the same way as the ESB dataset, by grouping audio-transcription samples according to the dataset from which they were taken. We encourage participants to use this dataset when evaluating their systems to quickly assess performance on a range of different speech recognition conditions. For more information, visit: [esb/diagnostic-dataset](https://huggingface.co/datasets/esb/diagnostic-dataset). ## Summary of ESB Datasets | Dataset | Domain | Speaking Style | Train (h) | Dev (h) | Test (h) | Transcriptions | License | |--------------|-----------------------------|-----------------------|-----------|---------|----------|--------------------|-----------------| | LibriSpeech | Audiobook | Narrated | 960 | 11 | 11 | Normalised | CC-BY-4.0 | | Common Voice | Wikipedia | Narrated | 1409 | 27 | 27 | Punctuated & Cased | CC0-1.0 | | Voxpopuli | European Parliament | Oratory | 523 | 5 | 5 | Punctuated | CC0 | | TED-LIUM | TED talks | Oratory | 454 | 2 | 3 | Normalised | CC-BY-NC-ND 3.0 | | GigaSpeech | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500 | 12 | 40 | Punctuated | apache-2.0 | | SPGISpeech | Fincancial meetings | Oratory, spontaneous | 4900 | 100 | 100 | Punctuated & Cased | User Agreement | | Earnings-22 | Fincancial meetings | Oratory, spontaneous | 105 | 5 | 5 | Punctuated & Cased | CC-BY-SA-4.0 | | AMI | Meetings | Spontaneous | 78 | 9 | 9 | Punctuated & Cased | CC-BY-4.0 | ## LibriSpeech The LibriSpeech corpus is a standard large-scale corpus for assessing ASR systems. It consists of approximately 1,000 hours of narrated audiobooks from the [LibriVox](https://librivox.org) project. It is licensed under CC-BY-4.0. Example Usage: ```python librispeech = load_dataset("esb/datasets", "librispeech") ``` Train/validation splits: - `train` (combination of `train.clean.100`, `train.clean.360` and `train.other.500`) - `validation.clean` - `validation.other` Test splits: - `test.clean` - `test.other` Also available are subsets of the train split, which can be accessed by setting the `subconfig` argument: ```python librispeech = load_dataset("esb/datasets", "librispeech", subconfig="clean.100") ``` - `clean.100`: 100 hours of training data from the 'clean' subset - `clean.360`: 360 hours of training data from the 'clean' subset - `other.500`: 500 hours of training data from the 'other' subset ## Common Voice Common Voice is a series of crowd-sourced open-licensed speech datasets where speakers record text from Wikipedia in various languages. The speakers are of various nationalities and native languages, with different accents and recording conditions. We use the English subset of version 9.0 (27-4-2022), with approximately 1,400 hours of audio-transcription data. It is licensed under CC0-1.0. Example usage: ```python common_voice = load_dataset("esb/datasets", "common_voice", use_auth_token=True) ``` Training/validation splits: - `train` - `validation` Test splits: - `test` ## VoxPopuli VoxPopuli is a large-scale multilingual speech corpus consisting of political data sourced from 2009-2020 European Parliament event recordings. The English subset contains approximately 550 hours of speech largely from non-native English speakers. It is licensed under CC0. Example usage: ```python voxpopuli = load_dataset("esb/datasets", "voxpopuli") ``` Training/validation splits: - `train` - `validation` Test splits: - `test` ## TED-LIUM TED-LIUM consists of English-language TED Talk conference videos covering a range of different cultural, political, and academic topics. It contains approximately 450 hours of transcribed speech data. It is licensed under CC-BY-NC-ND 3.0. Example usage: ```python tedlium = load_dataset("esb/datasets", "tedlium") ``` Training/validation splits: - `train` - `validation` Test splits: - `test` ## GigaSpeech GigaSpeech is a multi-domain English speech recognition corpus created from audiobooks, podcasts and YouTube. We provide the large train set (2,500 hours) and the standard validation and test splits. It is licensed under apache-2.0. Example usage: ```python gigaspeech = load_dataset("esb/datasets", "gigaspeech", use_auth_token=True) ``` Training/validation splits: - `train` (`l` subset of training data (2,500 h)) - `validation` Test splits: - `test` Also available are subsets of the train split, which can be accessed by setting the `subconfig` argument: ```python gigaspeech = load_dataset("esb/datasets", "spgispeech", subconfig="xs", use_auth_token=True) ``` - `xs`: extra-small subset of training data (10 h) - `s`: small subset of training data (250 h) - `m`: medium subset of training data (1,000 h) - `xl`: extra-large subset of training data (10,000 h) ## SPGISpeech SPGISpeech consists of company earnings calls that have been manually transcribed by S&P Global, Inc according to a professional style guide. We provide the large train set (5,000 hours) and the standard validation and test splits. It is licensed under a Kensho user agreement. Loading the dataset requires authorization. Example usage: ```python spgispeech = load_dataset("esb/datasets", "spgispeech", use_auth_token=True) ``` Training/validation splits: - `train` (`l` subset of training data (~5,000 h)) - `validation` Test splits: - `test` Also available are subsets of the train split, which can be accessed by setting the `subconfig` argument: ```python spgispeech = load_dataset("esb/datasets", "spgispeech", subconfig="s", use_auth_token=True) ``` - `s`: small subset of training data (~200 h) - `m`: medium subset of training data (~1,000 h) ## Earnings-22 Earnings-22 is a 119-hour corpus of English-language earnings calls collected from global companies, with speakers of many different nationalities and accents. It is licensed under CC-BY-SA-4.0. Example usage: ```python earnings22 = load_dataset("esb/datasets", "earnings22") ``` Training/validation splits: - `train` - `validation` Test splits: - `test` ## AMI The AMI Meeting Corpus consists of 100 hours of meeting recordings from multiple recording devices synced to a common timeline. It is licensed under CC-BY-4.0. Example usage: ```python ami = load_dataset("esb/datasets", "ami") ``` Training/validation splits: - `train` - `validation` Test splits: - `test`
提供机构:
open-asr-leaderboard
原始信息汇总

数据集概述

基本信息

  • 名称: datasets
  • 语言: 英语 (en)
  • 语言创建方式: 众包 (crowdsourced) 和专家生成 (expert-generated)
  • 许可证: cc-by-4.0, apache-2.0, cc0-1.0, cc-by-nc-3.0, other
  • 多语言性: 单语 (monolingual)
  • 大小: 100K<n<1M 和 1M<n<10M
  • 源数据集: 原始, 扩展自 librispeech_asr 和 common_voice
  • 标签: asr, benchmark, speech, esb
  • 任务类别: 自动语音识别

数据集内容

  • 数据字段:
    • dataset: 样本来源的ESB数据集名称。
    • audio: 包含下载的音频文件路径、解码后的音频数组和采样率。
    • text: 音频文件的转录文本。
    • id: 数据样本的唯一标识符。

数据准备

  • 音频: 所有ESB数据集的音频被分割成适合训练ASR系统的样本长度。音频文件即时解码并转换为Python数组,无需进一步准备即可用于训练/评估脚本。
  • 转录: 转录文本以“错误修正”格式提供,无需进一步预处理即可用于训练/评估脚本。训练和验证分割提供转录,测试分割不提供转录。

访问与使用

  • 所有八个ESB数据集均可自由访问,但其中三个数据集(Common Voice, GigaSpeech, SPGISpeech)有特定的使用条款,需在数据集页面填写访问表单同意后方可使用。

诊断数据集

  • ESB包含一个8小时的诊断数据集,用于验证数据,包含来自ESB验证集的不同领域和说话风格的音频数据。转录遵循一致的风格指南,提供标准化和非标准化两种格式。

数据集详细信息

数据集 领域 说话风格 训练时长 (h) 开发时长 (h) 测试时长 (h) 转录格式 许可证
LibriSpeech 有声书 叙述式 960 11 11 标准化 CC-BY-4.0
Common Voice 维基百科 叙述式 1409 27 27 标点符号 & 大小写 CC0-1.0
Voxpopuli 欧洲议会 演讲式 523 5 5 标点符号 CC0
TED-LIUM TED演讲 演讲式 454 2 3 标准化 CC-BY-NC-ND 3.0
GigaSpeech 有声书, 播客, YouTube 叙述式, 自发式 2500 12 40 标点符号 apache-2.0
SPGISpeech 财务会议 演讲式, 自发式 4900 100 100 标点符号 & 大小写 用户协议
Earnings-22 财务会议 演讲式, 自发式 105 5 5 标点符号 & 大小写 CC-BY-SA-4.0
AMI 会议 自发式 78 9 9 标点符号 & 大小写 CC-BY-4.0

数据集加载示例

  • 使用load_dataset函数加载数据集,例如: python librispeech = load_dataset("esb/datasets", "librispeech")

  • 可根据需要选择不同的数据集名称和分割(train/validation/test)。

数据集子集

  • 部分数据集提供子集配置选项,如librispeechgigaspeech,可通过设置subconfig参数访问特定子集。
搜集汇总
背景与挑战
背景概述
该数据集是ESB(Evaluation on Speech Benchmark)项目下的自动语音识别(ASR)基准数据集集合,包含八个子数据集(如LibriSpeech、Common Voice等),覆盖多个领域和说话风格,总训练时长从78小时到5000小时不等,使用多种开源许可证。数据集已预处理为适合ASR训练和评估的格式,支持通过Hugging Face Datasets库便捷下载,但部分子数据集需要额外访问授权。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作