alvanlii/cantonese-youtube-transcription-chunked
收藏Hugging Face2024-04-29 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/alvanlii/cantonese-youtube-transcription-chunked
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: labels_1
dtype: string
- name: labels_2
dtype: string
- name: channel
dtype: string
- name: title
dtype: string
splits:
- name: train
num_bytes: 189408828505.48
num_examples: 516022
download_size: 185225339025
dataset_size: 189408828505.48
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
## Cantonese Youtube Pseudo-Transcription Dataset
- This dataset contains 2000+ hours of Cantonese audio from YouTube.
- Audio is transcribed using `simonl0909/whisper-large-v2-cantonese` (labels_1) and `Scrya/whisper-large-v2-cantonese` (labels_2) with speculative-decoding (`alvanlii/whisper-small-cantonese`).
- All audio files are truncated to a maximum of 30 seconds
#### TO-DOs
- [ ] Split audio based on speakers
- [ ] More data
This dataset contains over 2000 hours of Cantonese audio from YouTube, transcribed using `simonl0909/whisper-large-v2-cantonese` and `Scrya/whisper-large-v2-cantonese` models with speculative-decoding. All audio files are truncated to a maximum of 30 seconds. The dataset features include audio, labels, channel, and title, and provides a training set.
提供机构:
alvanlii
原始信息汇总
粤语YouTube伪转录数据集
数据集信息
- 特征:
audio: 音频文件,采样率为16000Hz。labels_1: 字符串类型,使用simonl0909/whisper-large-v2-cantonese进行转录。labels_2: 字符串类型,使用Scrya/whisper-large-v2-cantonese进行转录。channel: 字符串类型。title: 字符串类型。
- 分割:
train: 训练集,包含516022个样本,总大小为189408828505.48字节。
- 下载大小: 185225339025字节。
- 数据集大小: 189408828505.48字节。
- 配置:
default: 数据文件路径为data/train-*。
数据集描述
- 该数据集包含超过2000小时的粤语YouTube音频。
- 音频使用
simonl0909/whisper-large-v2-cantonese和Scrya/whisper-large-v2-cantonese进行转录,并使用alvanlii/whisper-small-cantonese进行推测解码。 - 所有音频文件被截断至最长30秒。
待办事项
- [ ] 根据说话者分割音频。
- [ ] 增加更多数据。



