five

WenetSpeech-Yue

收藏
魔搭社区2026-05-15 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/pengzhendong/WenetSpeech-Yue
下载链接
链接失效反馈
官方服务:
资源简介:
# WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation <p align="center"> Longhao Li<sup>1</sup>*, Zhao Guo<sup>1</sup>*, Hongjie Chen<sup>2</sup>, Yuhang Dai<sup>1</sup>, Ziyu Zhang<sup>1</sup>, Hongfei Xue<sup>1</sup>, Tianlun Zuo<sup>1</sup>, Chengyou Wang<sup>1</sup>, Shuiyuan Wang<sup>1</sup>, Xin Xu<sup>3</sup>, Hui Bu<sup>3</sup>, Jie Li<sup>2</sup>, Jian Kang<sup>2</sup>, Binbin Zhang<sup>4</sup>, Ruibin Yuan<sup>5</sup>, Ziya Zhou<sup>5</sup>, Wei Xue<sup>5</sup>, Lei Xie<sup>1</sup> </p> <p align="center"> <sup>1</sup> Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University <br> <sup>2</sup> Institute of Artificial Intelligence (TeleAI), China Telecom <br> <sup>3</sup> Beijing AISHELL Technology Co., Ltd. <br> <sup>4</sup> WeNet Open Source Community <br> <sup>5</sup> Hong Kong University of Science and Technology </p> <p align="center"> 📑 <a href="https://arxiv.org/abs/2509.03959">Paper</a> &nbsp&nbsp | &nbsp&nbsp 🐙 <a href="https://github.com/ASLP-lab/WenetSpeech-Yue">GitHub</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/collections/ASLP-lab/wenetspeech-yue-68b690d287cde88389e5dba1">HuggingFace</a> <br> 🖥️ <a href="https://huggingface.co/spaces/ASLP-lab/WenetSpeech-Yue">HuggingFace Space</a> &nbsp&nbsp | &nbsp&nbsp 🎤 <a href="https://aslp-lab.github.io/WenetSpeech-Yue/">Demo Page</a> &nbsp&nbsp | &nbsp&nbsp 💬 <a href="https://github.com/ASLP-lab/WenetSpeech-Yue?tab=readme-ov-file#contact">Contact Us</a> </p> <div align="center"> <img width="800px" src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/wenetspeech_yue.svg" /> </div> ## Dataset ### WenetSpeech-Yue Overview * Contains 21,800 hours of large-scale Cantonese speech corpus with rich annotations, the largest open-source resource for Cantonese speech research. * Stores metadata in a single JSON file, including audio path, duration, text confidence, speaker identity, SNR, DNSMOS, age, gender, and character-level timestamps. Additional metadata tags may be added in the future. * Covers ten domains: Storytelling, Entertainment, Drama, Culture, Vlog, Commentary, Education, Podcast, News, and Others. <div align="center"> <img width="800px" src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/data_distribution.png" /> </div> ### Metadata Format We store all audio metadata in a standardized JSON format, where the core fields include `utt_id` (unique identifier for each audio segment), `rover_result` (ROVER result of three ASR transcriptions), `confidence` (confidence score of text transcription), `jyutping_confidence` (confidence score of Cantonese pinyin transcriptions), and `duration` (audio duration); speaker attributes include `speaker_id`, `gender`, and `age`; audio quality assessment metrics include `sample_rate`, `DNSMOS`, and `SNR`; timestamp information includes `timestamp` (precisely recording segment boundaries with `start` and `end`); and extended metadata under the `meta_info` field includes `program` (program name), `region` (geographical information), `link` (original content link), and `domain` (domain classification). Json Example: ``` { "key": "xg0054364_9798410_9801030", "rover_result": "人多一齐食咁样先至知味", "confidence": 0.879, "jyutping_confidence": 0.909, "duration": 2.816, "meta_info": { "region": "Hong Kong", "program": "Cantonese radio drama "I'll Send You Flowers Next Year" featuring Kathy Chow, Jacob Tsui, and Law Wai-kit. A 2002 production by Radio Television Hong Kong (RTHK).", "time_stamp": "9798.410_9801.030", "link": "<link>", "domain": "Drama" }, "speaker_attributes": { "spk_id": "xg0054364_SPEAKER_08", "gender": "Male", "age": "YOUTH" }, "speech_quality": { "sampling_rate": 16000, "DNSMOS": 3.2549686431884766, "SNR": 25.29012680053711 }, "timestamps": [ [["<eps>", [0.0, 0.26]], ["人", [0.26, 0.48]], ["多", [0.48, 0.64]], ["一", [0.64, 0.74]], ["齐", [0.74, 0.92]]], [["食", [0.93, 1.15]], ["<eps>", [1.15, 1.39]], ["咁", [1.39, 1.53]], ["样", [1.52, 1.6]], ["先", [1.6, 1.75]]], [["至", [1.75, 1.83]], ["知", [1.83, 2.04]], ["味", [2.04, 2.4]], ["<eps>", [2.4, 2.78]]] ] } ``` ### WenetSpeech Usage You can obtain the original video source through the `link` field in the metadata file (`wenetspeech_yue_meta.json`). Segment the audio according to the `cut_point` field to extract the corresponding record. For pre-processed audio data, please contact us using the information provided below. ## Contact If you have any questions or would like to collaborate, feel free to reach out to our research team via email: lhli@mail.nwpu.edu.cn or gzhao@mail.nwpu.edu.cn You’re also welcome to join our WeChat group for technical discussions, updates, and — as mentioned above — access to pre-processed audio data. <p align="center"> <img src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/wechat.jpg" width="300" alt="WeChat Group QR Code"/> <em>Scan to join our WeChat discussion group</em> </p> <p align="center"> <img src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/npu@aslp.jpeg" width="300" alt="Official Account QR Code"/> </p>

# WenetSpeech-Yue:面向粤语语音研究的多维度标注大规模语料库 <p align="center"> 李龙浩<sup>1</sup>*, 郭钊<sup>1</sup>*, 陈宏杰<sup>2</sup>, 戴宇航<sup>1</sup>, 张子瑜<sup>1</sup>, 薛鸿飞<sup>1</sup>, 左天伦<sup>1</sup>, 王承友<sup>1</sup>, 王水渊<sup>1</sup>, 徐鑫<sup>3</sup>, 卜辉<sup>3</sup>, 李杰<sup>2</sup>, 康健<sup>2</sup>, 张彬彬<sup>4</sup>, 袁瑞斌<sup>5</sup>, 周子雅<sup>5</sup>, 薛巍<sup>5</sup>, 谢磊<sup>1</sup> </p> <p align="center"> <sup>1</sup> 西北工业大学音频、语音与语言处理课题组(Audio, Speech and Language Processing Group, ASLP@NPU)<br> <sup>2</sup> 中国电信人工智能研究院(TeleAI)<br> <sup>3</sup> 北京爱数智慧科技有限公司(Beijing AISHELL Technology Co., Ltd.)<br> <sup>4</sup> WeNet开源社区(WeNet Open Source Community)<br> <sup>5</sup> 香港科技大学 </p> <p align="center"> 📑 <a href="https://arxiv.org/abs/2509.03959">论文</a> &nbsp&nbsp | &nbsp&nbsp 🐙 <a href="https://github.com/ASLP-lab/WenetSpeech-Yue">GitHub仓库</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/collections/ASLP-lab/wenetspeech-yue-68b690d287cde88389e5dba1">HuggingFace集合页面</a> <br> 🖥️ <a href="https://huggingface.co/spaces/ASLP-lab/WenetSpeech-Yue">HuggingFace演示空间</a> &nbsp&nbsp | &nbsp&nbsp 🎤 <a href="https://aslp-lab.github.io/WenetSpeech-Yue/">官方演示页面</a> &nbsp&nbsp | &nbsp&nbsp 💬 <a href="https://github.com/ASLP-lab/WenetSpeech-Yue?tab=readme-ov-file#contact">联系我们</a> </p> <div align="center"> <img width="800px" src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/wenetspeech_yue.svg" /> </div> ## 数据集 ### WenetSpeech-Yue 概述 * 包含21800小时的大规模粤语语音语料库,附带丰富的多维度标注,是目前开源领域规模最大的粤语语音研究资源。 * 所有元数据存储于单个JSON文件中,涵盖音频路径、时长、文本置信度、说话人身份、信噪比(Signal-to-Noise Ratio, SNR)、DNSMOS评分、年龄、性别以及字符级时间戳等信息,未来或将新增更多元数据标签。 * 覆盖十大应用领域:故事讲述、娱乐、戏剧、文化、Vlog、评论、教育、播客、新闻及其他类别。 <div align="center"> <img width="800px" src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/data_distribution.png" /> </div> ### 元数据格式 我们采用标准化JSON格式存储所有音频元数据,核心字段包括:`utt_id`(单条音频片段的唯一标识符)、`rover_result`(三种自动语音识别(Automatic Speech Recognition, ASR)结果的ROVER融合结果)、`confidence`(文本转录置信度评分)、`jyutping_confidence`(粤语拼音(Jyutping)转录置信度评分)以及`duration`(音频时长);说话人属性包含`speaker_id`、`gender`及`age`;音频质量评估指标涵盖`sample_rate`(采样率)、`DNSMOS`及`SNR`;时间戳信息包含`timestamp`(精确记录片段起止边界,含`start`与`end`字段);`meta_info`字段下的扩展元数据包括`program`(节目名称)、`region`(地域信息)、`link`(原始内容链接)及`domain`(领域分类)。 JSON示例: json { "key": "xg0054364_9798410_9801030", "rover_result": "人多一齐食咁样先至知味", "confidence": 0.879, "jyutping_confidence": 0.909, "duration": 2.816, "meta_info": { "region": "Hong Kong", "program": "Cantonese radio drama "I'll Send You Flowers Next Year" featuring Kathy Chow, Jacob Tsui, and Law Wai-kit. A 2002 production by Radio Television Hong Kong (RTHK).", "time_stamp": "9798.410_9801.030", "link": "<link>", "domain": "Drama" }, "speaker_attributes": { "spk_id": "xg0054364_SPEAKER_08", "gender": "Male", "age": "YOUTH" }, "speech_quality": { "sampling_rate": 16000, "DNSMOS": 3.2549686431884766, "SNR": 25.29012680053711 }, "timestamps": [ [["<eps>", [0.0, 0.26]], ["人", [0.26, 0.48]], ["多", [0.48, 0.64]], ["一", [0.64, 0.74]], ["齐", [0.74, 0.92]]], [["食", [0.93, 1.15]], ["<eps>", [1.15, 1.39]], ["咁", [1.39, 1.53]], ["样", [1.52, 1.6]], ["先", [1.6, 1.75]]], [["至", [1.75, 1.83]], ["知", [1.83, 2.04]], ["味", [2.04, 2.4]], ["<eps>", [2.4, 2.78]]] ] } ### WenetSpeech-Yue 使用方式 您可通过元数据文件(`wenetspeech_yue_meta.json`)中的`link`字段获取原始视频源,并根据`cut_point`字段对音频进行分段,提取对应语音片段。如需获取预处理后的音频数据,请通过下方联系方式与我们取得联系。 ## 联系方式 若您有任何疑问或合作意向,可通过以下邮箱联系我们的研究团队:lhli@mail.nwpu.edu.cn 或 gzhao@mail.nwpu.edu.cn 您也可加入我们的微信技术交流群,获取最新动态与技术讨论——如前文所述,该群亦可提供预处理音频数据的获取渠道。 <p align="center"> <img src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/wechat.jpg" width="300" alt="微信讨论群二维码"/> <em>扫码加入我们的微信讨论群</em> </p> <p align="center"> <img src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/npu@aslp.jpeg" width="300" alt="官方公众号二维码"/> </p>
提供机构:
maas
创建时间:
2025-10-23
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
WenetSpeech-Yue是一个包含21,800小时粤语语音的大规模语料库,提供多维度标注信息并覆盖10个不同领域,是目前最大的开源粤语语音研究资源。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作