hf-audio/esb-datasets-test-only

Name: hf-audio/esb-datasets-test-only
Creator: hf-audio
Published: 2023-08-29 12:45:54
License: 暂无描述

Hugging Face2023-08-29 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/hf-audio/esb-datasets-test-only

下载链接

链接失效反馈

官方服务：

资源简介：

ESB数据集是一个用于自动语音识别（ASR）任务的多数据集集合，包含LibriSpeech、Common Voice、VoxPopuli等多个子数据集。这些数据集涵盖了不同的领域和说话风格，如有声读物、维基百科、欧洲议会演讲等。每个数据集都提供了详细的训练、验证和测试集，以及相应的转录文本。数据集的使用需要通过Hugging Face Datasets库进行加载和准备，且部分数据集需要特定的使用协议。

The ESB dataset is a multi-dataset collection designed for automatic speech recognition (ASR) tasks, which includes multiple sub-datasets such as LibriSpeech, Common Voice, and VoxPopuli. These datasets cover diverse domains and speaking styles, including audiobooks, Wikipedia articles, and European Parliament speeches. Each dataset provides detailed training, validation, and test sets alongside their corresponding transcriptions. To use this dataset, one must load and prepare it via the Hugging Face Datasets library, and specific usage agreements are required for some of the sub-datasets.

提供机构：

hf-audio

原始信息汇总

数据集概述

数据集基本信息

名称: datasets
语言: 英语 (en)
语言创建方式: 众包 (crowdsourced) 和专家生成 (expert-generated)
许可证: cc-by-4.0, apache-2.0, cc0-1.0, cc-by-nc-3.0, other
多语言性: 单语 (monolingual)
大小: 100K<n<1M 和 1M<n<10M
源数据集: 原始, 扩展自 librispeech_asr 和 common_voice
标签: asr, benchmark, speech, esb
任务类别: 自动语音识别

数据集内容

数据点结构:
- dataset: 数据集名称
- audio: 包含音频文件路径、解码音频数组和采样率
- text: 音频文件的转录文本
- id: 数据样本的唯一ID

数据准备

音频: 音频已分割成适合训练ASR系统的样本长度，无需进一步准备。
转录: 转录文本已进行错误校正，无需进一步处理。

访问和使用

访问: 所有八个数据集均可自由访问，但其中三个数据集（Common Voice, GigaSpeech, SPGISpeech）有特定的使用条款，需同意后方可使用。
使用: 数据集已完全准备，可直接用于训练/评估脚本。测试集的转录不提供，需生成预测并上传至评分平台。

诊断数据集

描述: 包含8小时的小型诊断数据集，用于评估不同语音识别条件下的系统性能。

数据集详细信息

数据集	领域	说话风格	训练时长	开发时长	测试时长	转录格式	许可证
LibriSpeech	有声书	叙述式	960小时	11小时	11小时	标准化	CC-BY-4.0
Common Voice	维基百科	叙述式	1409小时	27小时	27小时	标点符号和大小写	CC0-1.0
Voxpopuli	欧洲议会	演讲式	523小时	5小时	5小时	标点符号	CC0
TED-LIUM	TED演讲	演讲式	454小时	2小时	3小时	标准化	CC-BY-NC-ND 3.0
GigaSpeech	有声书、播客、YouTube	叙述式、自发式	2500小时	12小时	40小时	标点符号	apache-2.0
SPGISpeech	金融会议	演讲式、自发式	4900小时	100小时	100小时	标点符号和大小写	用户协议
Earnings-22	金融会议	演讲式、自发式	105小时	5小时	5小时	标点符号和大小写	CC-BY-SA-4.0
AMI	会议	自发式	78小时	9小时	9小时	标点符号和大小写	CC-BY-4.0

数据集加载示例

LibriSpeech: python librispeech = load_dataset("esb/datasets", "librispeech")
Common Voice: python common_voice = load_dataset("esb/datasets", "common_voice", use_auth_token=True)
VoxPopuli: python voxpopuli = load_dataset("esb/datasets", "voxpopuli")
TED-LIUM: python tedlium = load_dataset("esb/datasets", "tedlium")
GigaSpeech: python gigaspeech = load_dataset("esb/datasets", "gigaspeech", use_auth_token=True)
SPGISpeech: python spgispeech = load_dataset("esb/datasets", "spgispeech", use_auth_token=True)
Earnings-22: python earnings22 = load_dataset("esb/datasets", "earnings22")
AMI: python ami = load_dataset("esb/datasets", "ami")

5,000+

优质数据集

54 个

任务类型

进入经典数据集