hf-audio/esb-datasets-test-only-sorted

Name: hf-audio/esb-datasets-test-only-sorted
Creator: hf-audio
Published: 2025-10-30 11:20:24
License: 暂无描述

Hugging Face2025-10-30 更新2024-07-06 收录

下载链接：

https://hf-mirror.com/datasets/hf-audio/esb-datasets-test-only-sorted

下载链接

链接失效反馈

官方服务：

资源简介：

ESB测试数据集是一个包含多个子集的语音识别数据集，涵盖了不同的领域和说话风格。数据集包括LibriSpeech、Common Voice、Voxpopuli、TED-LIUM、GigaSpeech、SPGISpeech、Earnings-22和AMI等子集。每个子集都包含音频文件及其对应的转录文本，音频文件的采样率为16000Hz。数据集的结构经过排序，按音频长度从长到短排列，并且数据格式从自定义加载脚本转换为parquet格式，以确保安全性。数据集的使用需要通过Hugging Face Datasets库进行加载，部分数据集（如Common Voice、GigaSpeech、SPGISpeech）需要用户同意特定的使用条款才能访问。

This dataset is derived from the [open-asr-leaderboard/datasets-test-only](hf.co/datasets/open-asr-leaderboard/datasets-test-only) data, sorted by audio length. The format has been changed from a custom loading script (un-safe remote code) to parquet (safe). The dataset includes multiple configurations, each with specific features and splits. The main features include audio, dataset name, text, unique identifier, and audio length. The dataset splits include test sets, each with its number of bytes and examples. The download size and actual size of the dataset are also detailed. Additionally, there are specific terms and conditions for accessing and using the dataset, especially for the Common Voice, GigaSpeech, and SPGISpeech datasets.

提供机构：

hf-audio

原始信息汇总

ESB Test Sets: Parquet & Sorted

数据集概述

该数据集包含多个语音识别测试集，每个数据集的音频采样率为16000Hz，并按音频长度排序。数据格式从自定义加载脚本转换为Parquet格式。

数据集配置

AMI

特征:
- audio: 音频数据，采样率16000Hz
- dataset: 数据集名称
- text: 文本数据
- id: 唯一标识符
- audio_length_s: 音频长度（秒）
分割:
- test: 12643个样本，7313111878.091001字节
下载大小: 1300234949字节
数据集大小: 7313111878.091001字节

Common Voice

特征:
- audio: 音频数据，采样率16000Hz
- dataset: 数据集名称
- text: 文本数据
- id: 唯一标识符
- audio_length_s: 音频长度（秒）
分割:
- test: 16334个样本，1312573669.596字节
下载大小: 720365151字节
数据集大小: 1312573669.596字节

Earnings22

特征:
- audio: 音频数据，采样率16000Hz
- dataset: 数据集名称
- text: 文本数据
- id: 唯一标识符
- audio_length_s: 音频长度（秒）
分割:
- test: 2741个样本，2066334357.212字节
下载大小: 1103990916字节
数据集大小: 2066334357.212字节

GigaSpeech

特征:
- audio: 音频数据，采样率16000Hz
- dataset: 数据集名称
- text: 文本数据
- id: 唯一标识符
- audio_length_s: 音频长度（秒）
分割:
- test: 19931个样本，9091854759.2字节
下载大小: 4034348699字节
数据集大小: 9091854759.2字节

LibriSpeech

特征:
- audio: 音频数据，采样率16000Hz
- dataset: 数据集名称
- text: 文本数据
- id: 唯一标识符
- audio_length_s: 音频长度（秒）
分割:
- test.clean: 2620个样本，367597326.0字节
- test.other: 2939个样本，352273450.594字节
下载大小: 683412729字节
数据集大小: 719870776.594字节

SPGISpeech

特征:
- audio: 音频数据，采样率16000Hz
- dataset: 数据集名称
- text: 文本数据
- id: 唯一标识符
- audio_length_s: 音频长度（秒）
分割:
- test: 39341个样本，18550272806.201字节
下载大小: 11377636910字节
数据集大小: 18550272806.201字节

TEDLIUM

特征:
- audio: 音频数据，采样率16000Hz
- dataset: 数据集名称
- text: 文本数据
- id: 唯一标识符
- audio_length_s: 音频长度（秒）
分割:
- test: 1155个样本，301767478.0字节
下载大小: 301630209字节
数据集大小: 301767478.0字节

VoxPopuli

特征:
- audio: 音频数据，采样率16000Hz
- dataset: 数据集名称
- text: 文本数据
- id: 唯一标识符
- audio_length_s: 音频长度（秒）
分割:
- test: 1842个样本，1612296642.268字节
下载大小: 944084987字节
数据集大小: 1612296642.268字节