maxseats/aihub-464-preprocessed-680GB-set-10-discard
收藏Hugging Face2024-07-01 更新2024-07-06 收录
下载链接:
https://hf-mirror.com/datasets/maxseats/aihub-464-preprocessed-680GB-set-10-discard
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含音频、转录文本、输入特征和标签等特征,分为训练集、测试集和验证集,分别包含44064、5508和5508个样本。然而,由于音频与标签顺序不匹配,该数据集已被废弃,且在处理过程中存在错误。
The dataset includes features such as audio, transcripts, input features, and labels. The audio has a sampling rate of 16000Hz, transcripts are string type, and input features and labels are sequences of float32 and int64 respectively. The dataset is divided into train, test, and valid splits, containing 44064, 5508, and 5508 samples respectively. The download size of the dataset is 22014330857 bytes, and the total size is 63554709015.0 bytes. However, this dataset was discarded due to mismatched order between audio and labels, and there was an error during the preprocessing process.
提供机构:
maxseats
原始信息汇总
数据集概述
数据集信息
-
特征:
audio: 音频数据,采样率为16000。transcripts: 文本数据,数据类型为字符串。input_features: 输入特征,数据类型为浮点数序列。labels: 标签,数据类型为整数序列。
-
数据集划分:
train: 训练集,包含44064个样本,大小为50843767212.0字节。test: 测试集,包含5508个样本,大小为6355470901.5字节。valid: 验证集,包含5508个样本,大小为6355470901.5字节。
-
数据集大小:
- 下载大小: 22014330857字节。
- 总数据集大小: 63554709015.0字节。
配置信息
- 配置名称:
default- 数据文件路径:
- 训练集:
data/train-* - 测试集:
data/test-* - 验证集:
data/valid-*
- 训练集:
- 数据文件路径:
数据集状态
- 该数据集因音频与标签顺序不匹配而被废弃。
- 在预处理过程中曾出现错误,目前
set_28及之前的版本标签与数据匹配。



