alvanlii/cantonese-youtube-transcription-chunked

Name: alvanlii/cantonese-youtube-transcription-chunked
Creator: alvanlii
Published: 2024-04-29 04:31:49
License: 暂无描述

Hugging Face2024-04-29 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/alvanlii/cantonese-youtube-transcription-chunked

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: audio dtype: audio: sampling_rate: 16000 - name: labels_1 dtype: string - name: labels_2 dtype: string - name: channel dtype: string - name: title dtype: string splits: - name: train num_bytes: 189408828505.48 num_examples: 516022 download_size: 185225339025 dataset_size: 189408828505.48 configs: - config_name: default data_files: - split: train path: data/train-* --- ## Cantonese Youtube Pseudo-Transcription Dataset - This dataset contains 2000+ hours of Cantonese audio from YouTube. - Audio is transcribed using `simonl0909/whisper-large-v2-cantonese` (labels_1) and `Scrya/whisper-large-v2-cantonese` (labels_2) with speculative-decoding (`alvanlii/whisper-small-cantonese`). - All audio files are truncated to a maximum of 30 seconds #### TO-DOs - [ ] Split audio based on speakers - [ ] More data

This dataset contains over 2000 hours of Cantonese audio from YouTube, transcribed using `simonl0909/whisper-large-v2-cantonese` and `Scrya/whisper-large-v2-cantonese` models with speculative-decoding. All audio files are truncated to a maximum of 30 seconds. The dataset features include audio, labels, channel, and title, and provides a training set.

提供机构：

alvanlii

原始信息汇总

粤语YouTube伪转录数据集

数据集信息

特征:
- audio: 音频文件，采样率为16000Hz。
- labels_1: 字符串类型，使用simonl0909/whisper-large-v2-cantonese进行转录。
- labels_2: 字符串类型，使用Scrya/whisper-large-v2-cantonese进行转录。
- channel: 字符串类型。
- title: 字符串类型。
分割:
- train: 训练集，包含516022个样本，总大小为189408828505.48字节。
下载大小: 185225339025字节。
数据集大小: 189408828505.48字节。
配置:
- default: 数据文件路径为data/train-*。

数据集描述

该数据集包含超过2000小时的粤语YouTube音频。
音频使用simonl0909/whisper-large-v2-cantonese和Scrya/whisper-large-v2-cantonese进行转录，并使用alvanlii/whisper-small-cantonese进行推测解码。
所有音频文件被截断至最长30秒。

待办事项

[ ] 根据说话者分割音频。
[ ] 增加更多数据。

5,000+

优质数据集

54 个

任务类型

进入经典数据集