mixed_cantonese_and_english_speech

Name: mixed_cantonese_and_english_speech
Creator: maas
Published: 2025-11-05 13:45:53
License: 暂无描述

魔搭社区2025-11-05 更新2025-03-08 收录

下载链接：

https://modelscope.cn/datasets/pengzhendong/mixed_cantonese_and_english_speech

下载链接

链接失效反馈

官方服务：

资源简介：

The Mixed Cantonese and English (MCE) dataset covers 18 topics related to daily life, comprising a total of 34.8 hours of audio files. The corresponding annotated text consists of 307,540 Chinese characters and 70,132 English words. Among the topics, the "Food" category has the highest frequency of English words, with a Chinese character to English word ratio of approximately 3:1. On the other hand, the "Tech News" topic has the lowest frequency of English words, approximately 8:1. We randomly sampled all audio files and divided them into training and testing sets in a 9:1 ratio. The resulting training set contains 31.3 hours of speech files, and the distribution of topics in the training and testing sets is relatively consistent. Most audio files contain only one segment of speech. The duration of audio files is concentrated in the 5-12 seconds range, with the longest audio file being 28 seconds. In most large-scale speech recognition models, there is no need for additional audio segmentation processing. During audio recording, all volunteers replicated their habitual speaking speed, intonation, and other speaking habits from daily life. Volunteers with both fast and slow speech rates were selected, with faster speech rates potentially presenting more challenges for accurate recognition due to increased assimilation or pronunciation inaccuracies. Source: https://github.com/Shelton1013/Whisper_MCE

粤英混合（Mixed Cantonese and English, MCE）数据集涵盖18项与日常生活相关的主题，总计包含34.8小时的音频文件。其对应标注文本包含307540个中文字符与70132个英文单词。在所有主题中，“饮食”类的英文单词出现频率最高，中文字符与英文单词的比例约为3:1；与之相对，“科技新闻”主题的英文单词出现频率最低，比例约为8:1。我们对全部音频文件进行随机采样，并以9:1的比例划分为训练集与测试集，最终训练集包含31.3小时的语音音频，且训练集与测试集的主题分布相对一致。多数音频文件仅包含单一段语音，音频时长集中在5至12秒区间，最长音频文件时长为28秒。针对多数大规模语音识别模型而言，无需对音频进行额外的分段处理。音频录制过程中，所有志愿者均还原了日常生活中的说话语速、语调及其他语言习惯。本次招募的志愿者涵盖语速较快与较慢的两类人群，语速较快者由于发音同化现象增多或发音准确度下降，可能会给准确识别带来更大挑战。数据集来源：https://github.com/Shelton1013/Whisper_MCE

提供机构：

maas

创建时间：

2025-03-05

搜集汇总

数据集介绍