KasuleTrevor/subset-common-voice-sw_cleaned
收藏Hugging Face2024-07-17 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/KasuleTrevor/subset-common-voice-sw_cleaned
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个特征,如客户端ID、路径、音频、句子、用户投票(赞成票和反对票)、用户人口统计信息(如年龄、性别、口音、地区等)、文本片段、变体、异常字符标记及异常字符内容。数据集分为训练集、测试集和验证集,分别包含46,309、12,084和12,105个样本,总大小约为2.42GB。音频文件的采样率为16,000Hz。
The dataset includes multiple features such as client ID, audio path, audio data (sampling rate 16000), sentence content, up and down votes, age, gender, accent, locale, segment, variant, whether it contains abnormal characters and the abnormal characters themselves. The dataset is divided into training, testing, and validation parts, each with specific byte counts and sample counts. The total download size and actual size of the dataset are also provided.
提供机构:
KasuleTrevor
原始信息汇总
数据集概述
特征信息
- client_id: 字符串类型
- path: 字符串类型
- audio: 音频类型,采样率为16000
- sentence: 字符串类型
- up_votes: 64位整数类型
- down_votes: 64位整数类型
- age: 字符串类型
- gender: 字符串类型
- accent: 字符串类型
- locale: 字符串类型
- segment: 字符串类型
- variant: 字符串类型
- has_abnormal_chars: 布尔类型
- abnormal_chars: 字符串类型
数据集划分
- train: 包含46309个样本,大小为1604980438.0476892字节
- test: 包含12084个样本,大小为430244665.8041084字节
- validation: 包含12105个样本,大小为384989188.310178字节
数据集大小
- 下载大小: 2282507476字节
- 总大小: 2420214292.1619754字节
配置信息
- config_name: default
- data_files:
- train: data/train-*
- test: data/test-*
- validation: data/validation-*
- data_files:



