Subhadeep/common_voice_11_0_hi_pseudo_labelled

Name: Subhadeep/common_voice_11_0_hi_pseudo_labelled
Creator: Subhadeep
Published: 2023-11-22 10:31:21
License: 暂无描述

Hugging Face2023-11-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Subhadeep/common_voice_11_0_hi_pseudo_labelled

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: config_name: hi features: - name: client_id dtype: string - name: path dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: sentence dtype: string - name: up_votes dtype: int64 - name: down_votes dtype: int64 - name: age dtype: string - name: gender dtype: string - name: accent dtype: string - name: locale dtype: string - name: segment dtype: string - name: whisper_transcript sequence: int64 splits: - name: train num_bytes: 131053542.138 num_examples: 4361 - name: validation num_bytes: 64148344.509 num_examples: 2179 - name: test num_bytes: 100961651.174 num_examples: 2894 download_size: 260542039 dataset_size: 296163537.821 configs: - config_name: hi data_files: - split: train path: hi/train-* - split: validation path: hi/validation-* - split: test path: hi/test-* ---

数据集信息：配置名称：hi 特征： - 名称：client_id（客户端ID），数据类型：字符串 - 名称：path（路径），数据类型：字符串 - 名称：audio（音频），数据类型：音频格式，采样率为16000赫兹 - 名称：sentence（语句文本），数据类型：字符串 - 名称：up_votes（点赞票数），数据类型：64位整数 - 名称：down_votes（点踩票数），数据类型：64位整数 - 名称：age（年龄），数据类型：字符串 - 名称：gender（性别），数据类型：字符串 - 名称：accent（口音），数据类型：字符串 - 名称：locale（语言区域），数据类型：字符串 - 名称：segment（语音片段），数据类型：字符串 - 名称：whisper_transcript（Whisper语音转录序列，Whisper），数据类型：64位整数序列数据集划分： - 名称：train（训练集），数据字节数：131053542.138，样本数量：4361 - 名称：validation（验证集），数据字节数：64148344.509，样本数量：2179 - 名称：test（测试集），数据字节数：100961651.174，样本数量：2894 下载大小：260542039字节数据集总大小：296163537.821字节配置项： - 配置名称：hi 数据文件： - 划分集：train（训练集），数据路径：hi/train-* - 划分集：validation（验证集），数据路径：hi/validation-* - 划分集：test（测试集），数据路径：hi/test-*

提供机构：

Subhadeep

原始信息汇总

数据集概述

配置名称

config_name: hi

特征信息

client_id: 字符串类型
path: 字符串类型
audio: 音频类型，采样率为16000
sentence: 字符串类型
up_votes: 64位整数类型
down_votes: 64位整数类型
age: 字符串类型
gender: 字符串类型
accent: 字符串类型
locale: 字符串类型
segment: 字符串类型
whisper_transcript: 序列类型，64位整数

数据分割

train:
- 字节数: 131053542.138
- 样本数: 4361
validation:
- 字节数: 64148344.509
- 样本数: 2179
test:
- 字节数: 100961651.174
- 样本数: 2894

数据大小

下载大小: 260542039 字节
数据集大小: 296163537.821 字节

配置文件路径

config_name: hi
- train: hi/train-*
- validation: hi/validation-*
- test: hi/test-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集