cantonese-youtube
收藏魔搭社区2025-11-06 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/pengzhendong/cantonese-youtube
下载链接
链接失效反馈官方服务:
资源简介:
## Cantonese Youtube Pseudo-Transcription Dataset
- Contains approximately 10k hours of audio sourced from YouTube
- Videos are chosen at random, and scraped on a channel basis
- Includes news, vlogs, entertainment, stories, health
- Columns
- `transcript_whisper`: Transcribed using `Scrya/whisper-large-v2-cantonese` with `alvanlii/whisper-small-cantonese` for speculative decoding
- `transcript_sensevoice`: Transcribed using `FunAudioLLM/SenseVoiceSmall`
- used [OpenCC](https://github.com/BYVoid/OpenCC) to convert to traditional chinese
- isolated event tags to `event_sensevoice`
- isolated emotion tags to `emotion_sensevoice`
- `snr`: Signal-to-noise ratio, extracted from `ylacombe/brouhaha-best`
- `c50`: Speech clarity, extracted from `ylacombe/brouhaha-best`
- `emotion`: Emotion, extracted from `emotion2vec/emotion2vec_plus_large`
- Note that `id` does not reflect the ordering of the audio within the same video
- Processing
- The full audio is split using [WhisperX](https://github.com/m-bain/whisperX), using `Scrya/whisper-large-v2-cantonese`
- it is split in <30s chunks and according to speakers
- Preliminary filtering includes filtering out phrases like:
- `like/subscribe to YouTube channel`
- `subtitles by [xxxx]`
- Additional filtering is recommended for your own use
- Note: An earlier version of the dataset has duplicated data. I recommend re-downloading it if you downloaded it before Nov-7-2024
## 粤语YouTube伪转录数据集(Cantonese Youtube Pseudo-Transcription Dataset)
- 包含约10000小时源自YouTube的音频数据
- 音频随机选取,并按频道维度爬取
- 涵盖新闻、视频博客(vlog)、娱乐内容、故事及健康类题材
- 字段说明
- `transcript_whisper`:采用`Scrya/whisper-large-v2-cantonese`进行转录,并借助`alvanlii/whisper-small-cantonese`实现推测式解码
- `transcript_sensevoice`:由`FunAudioLLM/SenseVoiceSmall`完成转录
- 通过[OpenCC](https://github.com/BYVoid/OpenCC)将转录结果转换为繁体中文
- 提取事件标签并存入`event_sensevoice`字段
- 提取情感标签并存入`emotion_sensevoice`字段
- `snr`:信噪比(Signal-to-noise ratio),从`ylacombe/brouhaha-best`中提取
- `c50`:语音清晰度,从`ylacombe/brouhaha-best`中提取
- `emotion`:情感标签,从`emotion2vec/emotion2vec_plus_large`中提取
- 注意:`id`字段不反映同一视频内音频片段的排序
- 数据处理流程
- 完整音频通过[WhisperX](https://github.com/m-bain/whisperX)进行分片,采用`Scrya/whisper-large-v2-cantonese`模型,将音频切割为时长小于30秒的片段,并按说话人区分
- 初步过滤环节会剔除包含以下内容的语句:
- 「点赞/订阅本YouTube频道」类话术
- 「字幕由[xxxx]制作」类标注
- 建议使用者根据自身需求进行额外过滤
- 注意事项:该数据集早期版本存在重复数据,若您在2024年11月7日前下载过本数据集,建议重新获取。
提供机构:
maas
创建时间:
2025-03-12



