cantonese-radio
收藏魔搭社区2025-11-06 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/pengzhendong/cantonese-radio
下载链接
链接失效反馈官方服务:
资源简介:
## Cantonese Radio Pseudo-Transcription Dataset
- Contains 14k hours of audio sourced from Archive.org
- Columns
- `order_index`: Represents the order of the audio compared to those from the same `filename`
- `link`: Link of the original full audio
- `transcript_whisper`: Transcribed using `Scrya/whisper-large-v2-cantonese` with `alvanlii/whisper-small-cantonese` for speculative decoding
- `transcript_sensevoice`: Transcribed using `FunAudioLLM/SenseVoiceSmall`
- used [OpenCC](https://github.com/BYVoid/OpenCC) to convert to traditional chinese
- isolated event tags to `event_sensevoice`
- isolated emotion tags to `emotion_sensevoice`
- `snr`: Signal-to-noise ratio, extracted from `ylacombe/brouhaha-best`
- `c50`: Speech clarity, extracted from `ylacombe/brouhaha-best`
- `emotion`: Emotion, extracted from `emotion2vec/emotion2vec_plus_large`
- Note that `id` does not reflect the ordering of the audio within the same video
- Processing
- The full audio is split using [WhisperX](https://github.com/m-bain/whisperX), using `Scrya/whisper-large-v2-cantonese`
- it is split in <30s chunks and according to speakers
- No filtering or additional audio processing was done for this dataset
- Filtering is recommended for your own use
# 粤语广播伪转录数据集
- 包含源自互联网档案馆(Archive.org)的14000小时音频素材
- 字段说明
- `order_index`:代表同一份`filename`对应音频的排列顺序
- `link`:原始完整音频的链接
- `transcript_whisper`:采用`Scrya/whisper-large-v2-cantonese`模型,并结合`alvanlii/whisper-small-cantonese`进行推测性解码生成的转录文本
- `transcript_sensevoice`:采用`FunAudioLLM/SenseVoiceSmall`模型生成的转录文本,具体处理流程包括:
- 使用[OpenCC](https://github.com/BYVoid/OpenCC)将文本转换为繁体中文
- 提取事件标签并存入`event_sensevoice`字段
- 提取情感标签并存入`emotion_sensevoice`字段
- `snr`:信噪比,从`ylacombe/brouhaha-best`中提取得到
- `c50`:语音清晰度,从`ylacombe/brouhaha-best`中提取得到
- `emotion`:情感标签,从`emotion2vec/emotion2vec_plus_large`中提取得到
- 注意事项:`id`字段无法反映同视频内音频的实际排序
- 数据处理流程
- 完整音频通过[WhisperX](https://github.com/m-bain/whisperX)工具,结合`Scrya/whisper-large-v2-cantonese`模型进行切割:切割为时长小于30秒的片段,并按说话人进行区分
- 本数据集未进行任何过滤或额外音频处理操作,建议使用者根据自身需求自行完成过滤处理
提供机构:
maas
创建时间:
2025-03-12
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个粤语广播伪转录数据集,包含约14,000小时从Archive.org获取的音频,总大小约589.10GB。数据集提供了使用Whisper和SenseVoice模型生成的转录文本,并包含信噪比、语音清晰度和情感分析等辅助信息,音频已按说话人和30秒片段进行分割处理。
以上内容由遇见数据集搜集并总结生成



