eustlb/common_voice_17_0_es_pseudo_labelled

Name: eustlb/common_voice_17_0_es_pseudo_labelled
Creator: eustlb
Published: 2024-07-17 08:45:02
License: 暂无描述

Hugging Face2024-07-17 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/eustlb/common_voice_17_0_es_pseudo_labelled

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含多个特征，包括客户端ID、音频数据（包含数组和采样率）、句子文本、前序条件序列和Whisper转录文本。数据集分为一个训练集，包含64504个样本，总大小为223838055556字节，下载大小为168603268559字节。

The dataset includes multiple features such as client ID, audio data (containing array and sampling rate), sentence text, condition on previous sequence, and Whisper transcript. The dataset is divided into a training set containing 64,504 samples, with a total size of 223,838,055,556 bytes and a download size of 168,603,268,559 bytes.

提供机构：

eustlb

原始信息汇总

数据集概述

数据集信息

特征

client_id: 字符串类型
audio: 结构化数据
- array: 浮点数序列
- sampling_rate: 整数类型
sentence: 字符串类型
condition_on_prev: 整数序列
whisper_transcript: 字符串类型

数据分割

train:
- 样本数量: 64504
- 数据大小: 223838055556 字节

数据集大小

下载大小: 168603268559 字节
数据集总大小: 223838055556 字节

配置

default:
- 数据文件路径: data/train-*

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是Common Voice项目的西班牙语伪标签版本，包含约59,000条训练样本，总大小169 GB，以parquet格式存储。数据列包括音频、原始句子和Whisper模型生成的伪标签转录，适用于语音识别模型的训练或评估。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集