cantonese-radio

Name: cantonese-radio
Creator: maas
Published: 2025-11-06 09:18:38
License: 暂无描述

魔搭社区2025-11-06 更新2025-03-15 收录

下载链接：

https://modelscope.cn/datasets/pengzhendong/cantonese-radio

下载链接

链接失效反馈

官方服务：

资源简介：

## Cantonese Radio Pseudo-Transcription Dataset - Contains 14k hours of audio sourced from Archive.org - Columns - `order_index`: Represents the order of the audio compared to those from the same `filename` - `link`: Link of the original full audio - `transcript_whisper`: Transcribed using `Scrya/whisper-large-v2-cantonese` with `alvanlii/whisper-small-cantonese` for speculative decoding - `transcript_sensevoice`: Transcribed using `FunAudioLLM/SenseVoiceSmall` - used [OpenCC](https://github.com/BYVoid/OpenCC) to convert to traditional chinese - isolated event tags to `event_sensevoice` - isolated emotion tags to `emotion_sensevoice` - `snr`: Signal-to-noise ratio, extracted from `ylacombe/brouhaha-best` - `c50`: Speech clarity, extracted from `ylacombe/brouhaha-best` - `emotion`: Emotion, extracted from `emotion2vec/emotion2vec_plus_large` - Note that `id` does not reflect the ordering of the audio within the same video - Processing - The full audio is split using [WhisperX](https://github.com/m-bain/whisperX), using `Scrya/whisper-large-v2-cantonese` - it is split in <30s chunks and according to speakers - No filtering or additional audio processing was done for this dataset - Filtering is recommended for your own use

# 粤语广播伪转录数据集 - 包含源自互联网档案馆（Archive.org）的14000小时音频素材 - 字段说明 - `order_index`：代表同一份`filename`对应音频的排列顺序 - `link`：原始完整音频的链接 - `transcript_whisper`：采用`Scrya/whisper-large-v2-cantonese`模型，并结合`alvanlii/whisper-small-cantonese`进行推测性解码生成的转录文本 - `transcript_sensevoice`：采用`FunAudioLLM/SenseVoiceSmall`模型生成的转录文本，具体处理流程包括： - 使用[OpenCC](https://github.com/BYVoid/OpenCC)将文本转换为繁体中文 - 提取事件标签并存入`event_sensevoice`字段 - 提取情感标签并存入`emotion_sensevoice`字段 - `snr`：信噪比，从`ylacombe/brouhaha-best`中提取得到 - `c50`：语音清晰度，从`ylacombe/brouhaha-best`中提取得到 - `emotion`：情感标签，从`emotion2vec/emotion2vec_plus_large`中提取得到 - 注意事项：`id`字段无法反映同视频内音频的实际排序 - 数据处理流程 - 完整音频通过[WhisperX](https://github.com/m-bain/whisperX)工具，结合`Scrya/whisper-large-v2-cantonese`模型进行切割：切割为时长小于30秒的片段，并按说话人进行区分 - 本数据集未进行任何过滤或额外音频处理操作，建议使用者根据自身需求自行完成过滤处理

提供机构：

maas

创建时间：

2025-03-12

搜集汇总

数据集介绍