cantonese-youtube

Name: cantonese-youtube
Creator: maas
Published: 2025-11-06 06:13:20
License: 暂无描述

魔搭社区2025-11-06 更新2025-03-15 收录

下载链接：

https://modelscope.cn/datasets/pengzhendong/cantonese-youtube

下载链接

链接失效反馈

官方服务：

资源简介：

## Cantonese Youtube Pseudo-Transcription Dataset - Contains approximately 10k hours of audio sourced from YouTube - Videos are chosen at random, and scraped on a channel basis - Includes news, vlogs, entertainment, stories, health - Columns - `transcript_whisper`: Transcribed using `Scrya/whisper-large-v2-cantonese` with `alvanlii/whisper-small-cantonese` for speculative decoding - `transcript_sensevoice`: Transcribed using `FunAudioLLM/SenseVoiceSmall` - used [OpenCC](https://github.com/BYVoid/OpenCC) to convert to traditional chinese - isolated event tags to `event_sensevoice` - isolated emotion tags to `emotion_sensevoice` - `snr`: Signal-to-noise ratio, extracted from `ylacombe/brouhaha-best` - `c50`: Speech clarity, extracted from `ylacombe/brouhaha-best` - `emotion`: Emotion, extracted from `emotion2vec/emotion2vec_plus_large` - Note that `id` does not reflect the ordering of the audio within the same video - Processing - The full audio is split using [WhisperX](https://github.com/m-bain/whisperX), using `Scrya/whisper-large-v2-cantonese` - it is split in <30s chunks and according to speakers - Preliminary filtering includes filtering out phrases like: - `like/subscribe to YouTube channel` - `subtitles by [xxxx]` - Additional filtering is recommended for your own use - Note: An earlier version of the dataset has duplicated data. I recommend re-downloading it if you downloaded it before Nov-7-2024

## 粤语YouTube伪转录数据集（Cantonese Youtube Pseudo-Transcription Dataset） - 包含约10000小时源自YouTube的音频数据 - 音频随机选取，并按频道维度爬取 - 涵盖新闻、视频博客（vlog）、娱乐内容、故事及健康类题材 - 字段说明 - `transcript_whisper`：采用`Scrya/whisper-large-v2-cantonese`进行转录，并借助`alvanlii/whisper-small-cantonese`实现推测式解码 - `transcript_sensevoice`：由`FunAudioLLM/SenseVoiceSmall`完成转录 - 通过[OpenCC](https://github.com/BYVoid/OpenCC)将转录结果转换为繁体中文 - 提取事件标签并存入`event_sensevoice`字段 - 提取情感标签并存入`emotion_sensevoice`字段 - `snr`：信噪比（Signal-to-noise ratio），从`ylacombe/brouhaha-best`中提取 - `c50`：语音清晰度，从`ylacombe/brouhaha-best`中提取 - `emotion`：情感标签，从`emotion2vec/emotion2vec_plus_large`中提取 - 注意：`id`字段不反映同一视频内音频片段的排序 - 数据处理流程 - 完整音频通过[WhisperX](https://github.com/m-bain/whisperX)进行分片，采用`Scrya/whisper-large-v2-cantonese`模型，将音频切割为时长小于30秒的片段，并按说话人区分 - 初步过滤环节会剔除包含以下内容的语句： - 「点赞/订阅本YouTube频道」类话术 - 「字幕由[xxxx]制作」类标注 - 建议使用者根据自身需求进行额外过滤 - 注意事项：该数据集早期版本存在重复数据，若您在2024年11月7日前下载过本数据集，建议重新获取。

提供机构：

maas

创建时间：

2025-03-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集