five

mixed_cantonese_and_english_speech

收藏
魔搭社区2025-11-05 更新2025-03-08 收录
下载链接:
https://modelscope.cn/datasets/pengzhendong/mixed_cantonese_and_english_speech
下载链接
链接失效反馈
官方服务:
资源简介:
The Mixed Cantonese and English (MCE) dataset covers 18 topics related to daily life, comprising a total of 34.8 hours of audio files. The corresponding annotated text consists of 307,540 Chinese characters and 70,132 English words. Among the topics, the "Food" category has the highest frequency of English words, with a Chinese character to English word ratio of approximately 3:1. On the other hand, the "Tech News" topic has the lowest frequency of English words, approximately 8:1. We randomly sampled all audio files and divided them into training and testing sets in a 9:1 ratio. The resulting training set contains 31.3 hours of speech files, and the distribution of topics in the training and testing sets is relatively consistent. Most audio files contain only one segment of speech. The duration of audio files is concentrated in the 5-12 seconds range, with the longest audio file being 28 seconds. In most large-scale speech recognition models, there is no need for additional audio segmentation processing. During audio recording, all volunteers replicated their habitual speaking speed, intonation, and other speaking habits from daily life. Volunteers with both fast and slow speech rates were selected, with faster speech rates potentially presenting more challenges for accurate recognition due to increased assimilation or pronunciation inaccuracies. Source: https://github.com/Shelton1013/Whisper_MCE

粤英混合(Mixed Cantonese and English, MCE)数据集涵盖18项与日常生活相关的主题,总计包含34.8小时的音频文件。其对应标注文本包含307540个中文字符与70132个英文单词。在所有主题中,“饮食”类的英文单词出现频率最高,中文字符与英文单词的比例约为3:1;与之相对,“科技新闻”主题的英文单词出现频率最低,比例约为8:1。我们对全部音频文件进行随机采样,并以9:1的比例划分为训练集与测试集,最终训练集包含31.3小时的语音音频,且训练集与测试集的主题分布相对一致。多数音频文件仅包含单一段语音,音频时长集中在5至12秒区间,最长音频文件时长为28秒。针对多数大规模语音识别模型而言,无需对音频进行额外的分段处理。音频录制过程中,所有志愿者均还原了日常生活中的说话语速、语调及其他语言习惯。本次招募的志愿者涵盖语速较快与较慢的两类人群,语速较快者由于发音同化现象增多或发音准确度下降,可能会给准确识别带来更大挑战。数据集来源:https://github.com/Shelton1013/Whisper_MCE
提供机构:
maas
创建时间:
2025-03-05
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集包含混合粤语和英语的语音数据,覆盖18个日常生活主题,总时长34.8小时,并配有30.75万汉字和7.01万英文单词的标注文本。音频文件时长集中在5-12秒,训练与测试集按9:1比例划分,语音采集模拟了真实说话习惯。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作