WenetSpeech-Chuan

Name: WenetSpeech-Chuan
Creator: maas
Published: 2026-05-21 12:50:59
License: 暂无描述

魔搭社区2026-05-21 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/pengzhendong/WenetSpeech-Chuan

下载链接

链接失效反馈

官方服务：

资源简介：

<h1 align="center"> WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus With Rich Annotation For Dialectal Speech Processing </h1> Yuhang Dai1,*, Ziyu Zhang1,*, Shuai Wang4,5, Longhao Li1, Zhao Guo1, Tianlun Zuo1, Shuiyuan Wang1, Hongfei Xue1, Chengyou Wang1, Qing Wang3, Xin Xu2, Hui Bu2, Jie Li3, Jian Kang3, Binbin Zhang5, Lei Xie1,╀ 1 Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University 2 Beijing AISHELL Technology Co., Ltd. 3 Institute of Artificial Intelligence (TeleAI), China Telecom 4 School of Intelligence Science and Technology, Nanjing University 5 WeNet Open Source Community 📑 <a href="https://arxiv.org/abs/2509.18004">Paper</a> &nbsp&nbsp | &nbsp&nbsp 🐙 <a href="https://github.com/ASLP-lab/WenetSpeech-Chuan">GitHub</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/collections/ASLP-lab/wenetspeech-chuan-68bade9d02bcb1faece65bda">HuggingFace</a> 🎤 <a href="https://aslp-lab.github.io/WenetSpeech-Chuan/">Demo Page</a> &nbsp&nbsp | &nbsp&nbsp 💬 <a href="https://github.com/ASLP-lab/WenetSpeech-Chuan?tab=readme-ov-file#contact">Contact Us</a> <div align="center"> <img width="800px" src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/logo/WenetSpeech-Chuan-Logo.png?raw=true" /> </div> ## Dataset ### WenetSpeech-Chuan Overview * Contains 10,000 hours of large-scale Chuan-Yu dialect speech corpus with rich annotations, the largest open-source resource for Chuan-Yu dialect speech research.</li> * Stores metadata in a single JSON file, including audio path, duration, text confidence, speaker identity, SNR, DNSMOS, age, gender, and character-level timestamps. Additional metadata tags may be added in the future.</li> * Covers ten domains: Short videos, Entertainment, Live streams, Documentary, Audiobook, Drama, Interview, News and others.</li> <div align="center"> <img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/figs/domain.png?raw=true" width="300" style="display:inline-block; margin-right:10px;" /> <img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/figs/quality_distribution.jpg?raw=true" width="300" style="display:inline-block;" /> </div> ### Metadata Format We store all audio metadata in a standardized JSON format, where the core fields include `utt_id` (unique identifier for each audio segment), `rover_result` (ROVER result of three ASR transcriptions), `confidence` (confidence score of text transcription), `jyutping_confidence` (confidence score of Cantonese pinyin transcriptions), and `duration` (audio duration); speaker attributes include `speaker_id`, `gender`, and `age`; audio quality assessment metrics include `sample_rate`, `DNSMOS`, and `SNR`; timestamp information includes `timestamp` (precisely recording segment boundaries with `start` and `end`); and extended metadata under the `meta_info` field includes `program` (program name), `region` (geographical information), `link` (original content link), and `domain` (domain classification). #### 📂 Content Tree ``` WenetSpeech-Chuan ├── metadata.jsonl │ ├── audio_labels/ │ ├── wav_utt_id.jsonl │ ├── wav_utt_id.jsonl │ ├── ... │ └── wav_utt_id.jsonl │ ├── .gitattributes └── README.md ``` #### Data sample（CN）： ###### metadata.jsonl ``` { "utt_id": 原始长音频id, "wav_utt_id": 转化为wav后的长音频id, "source_audio_path": 原始长音频路径, "audio_labels": 转化后的长音频切分出的短音频标签文件路径, "url": 原始长音频下载链接 } ``` ###### audio_labels/wav_utt_id.jsonl： ``` { "wav_utt_id_timestamp": 以转化为wav后的长音频id_时间戳信息作为切分后的短音频id (type: str), "wav_utt_id_timestamp_path": 短音频数据路径 (type: str), "audio_clip_id": 该段短音频在长音频中的切分顺序编号, "timestamp": 时间戳信息, "wvmos_score": wvmos分数，衡量音频片段质量 (type: float), "text": 对应时间戳的音频片段的抄本 (type: str), "text_punc": 带标点的抄本 (type: str), "spk_num": 音频片段说话人个数，single/multi (type: str) "confidence": 抄本置信度 (type: float), "emotion": 说话人情感标签 (type: str，eg: 愤怒), "age": 说话人年龄标签 (type: int范围, eg: 中年（36~59）), "gender": 说话人性别标签 (type: str，eg: 男/女), } ``` #### Data sample（EN）： ###### metadata.jsonl ``` { "utt_id": Original long audio ID, "wav_utt_id": Converted long audio ID after transforming to WAV format, "source_audio_path": Path to the original long audio file, "audio_labels": Path to the label file of short audio segments cut from the converted long audio, "url": Download link for the original long audio } ``` ###### audio_labels/wav_utt_id.jsonl： ``` { "wav_utt_id_timestamp": Short audio segment ID, composed of the converted long audio ID + timestamp information (type: str), "wav_utt_id_timestamp_path": Path to the short audio data (type: str), "audio_clip_id": Sequence number of this short segment within the long audio, "timestamp": Timestamp information, "wvmos_score": WVMOS score, measuring the quality of the audio segment (type: float), "text": Transcript of the audio segment corresponding to the timestamp (type: str), "text_punc": Transcript with punctuation (type: str), "spk_num": Number of speakers in the audio segment, single/multi (type: str), "confidence": Confidence score of the transcript (type: float), "emotion": Speaker’s emotion label (type: str, e.g., anger), "age": Speaker’s age label (type: int range, e.g., middle-aged (36–59)), "gender": Speaker’s gender label (type: str, e.g., male/female) } ``` ### WenetSpeech Usage You can obtain the original video source through the `link` field in the metadata file (`metadata.json`). Segment the audio according to the `timestamps` field to extract the corresponding record. For pre-processed audio data, please contact us using the information provided below. ## Contact If you have any questions or would like to collaborate, feel free to reach out to our research team via email: yhdai@mail.nwpu.edu.cn or ziyu_zhang@mail.nwpu.edu.cn. You’re also welcome to join our WeChat group for technical discussions, updates, and — as mentioned above — access to pre-processed audio data. <img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/raw/main/src/figs/wechat_2.png" width="300" alt="WeChat Group QR Code"/> Scan to join our WeChat discussion group <img src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/npu@aslp.jpeg" width="300" alt="Official Account QR Code"/>

# WenetSpeech-Chuan：面向方言语音处理的大规模川渝方言标注语料库 **作者**：戴宇航1,*、张子宇1,*、王帅4,5、李龙昊1、郭钊1、左天伦1、王水原1、薛鸿飞1、王承友1、王青3、徐鑫2、卜辉2、李杰3、康健3、张彬彬5、谢磊1,╀ 1 西北工业大学音频、语音与语言处理组（ASLP@NPU） 2 北京爱数智慧科技有限公司 3 中国电信人工智能研究院（TeleAI） 4 南京大学智能科学与技术学院 5 WeNet开源社区 📑 <a href="https://arxiv.org/abs/2509.18004">论文</a> &nbsp&nbsp | &nbsp&nbsp 🐙 <a href="https://github.com/ASLP-lab/WenetSpeech-Chuan">GitHub仓库</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/collections/ASLP-lab/wenetspeech-chuan-68bade9d02bcb1faece65bda">HuggingFace数据集集合</a> 🎤 <a href="https://aslp-lab.github.io/WenetSpeech-Chuan/">演示页面</a> &nbsp&nbsp | &nbsp&nbsp 💬 <a href="https://github.com/ASLP-lab/WenetSpeech-Chuan?tab=readme-ov-file#contact">联系我们</a> <div align="center"> <img width="800px" src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/logo/WenetSpeech-Chuan-Logo.png?raw=true" /> </div> ## 数据集 ### WenetSpeech-Chuan 总览 * 包含10000小时的大规模川渝方言语音语料库，附带丰富标注信息，是目前开源的规模最大的川渝方言语音研究资源。 * 元数据以单一JSON文件存储，涵盖音频路径、时长、文本置信度、说话人身份、信噪比（SNR, Signal-to-Noise Ratio）、DNSMOS、年龄、性别以及字符级时间戳，未来或将新增更多元数据标签。 * 涵盖十大领域：短视频、娱乐、直播、纪录片、有声书、戏剧、访谈、新闻及其他领域。 <div align="center"> <img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/figs/domain.png?raw=true" width="300" style="display:inline-block; margin-right:10px;" /> <img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/figs/quality_distribution.jpg?raw=true" width="300" style="display:inline-block;" /> </div> ### 元数据格式我们将所有音频元数据以标准化JSON格式存储，核心字段包括： - `utt_id`：每条音频片段的唯一标识符 - `rover_result`：三种自动语音识别（ASR, Automatic Speech Recognition）转录结果的ROVER融合结果 - `confidence`：文本转录置信度得分 - `jyutping_confidence`：拼音转录置信度得分 - `duration`：音频时长 - 说话人属性：`speaker_id`、`gender`（性别）与`age`（年龄） - 音频质量评估指标：`sample_rate`（采样率）、`DNSMOS`与`SNR`（信噪比） - 时间戳信息：`timestamp`，精确记录片段起止点，包含`start`（起始时刻）与`end`（结束时刻） - `meta_info`字段下的扩展元数据包括：`program`（节目名称）、`region`（地理信息）、`link`（原始内容链接）与`domain`（领域分类） #### 📂 目录结构 WenetSpeech-Chuan ├── metadata.jsonl │ ├── audio_labels/ │ ├── wav_utt_id.jsonl │ ├── wav_utt_id.jsonl │ ├── ... │ └── wav_utt_id.jsonl │ ├── .gitattributes └── README.md #### 数据样本（中文） ##### metadata.jsonl { "utt_id": 原始长音频标识符, "wav_utt_id": 转换为WAV格式后的长音频标识符, "source_audio_path": 原始长音频文件路径, "audio_labels": 从转换后的长音频中切分出的短音频片段的标签文件路径, "url": 原始长音频下载链接 } ##### audio_labels/wav_utt_id.jsonl { "wav_utt_id_timestamp": 以转换后的长音频标识符+时间戳信息组成的短音频片段ID（字符串类型）, "wav_utt_id_timestamp_path": 短音频数据路径（字符串类型）, "audio_clip_id": 该短音频片段在长音频中的序列编号, "timestamp": 时间戳信息, "wvmos_score": 用于衡量音频片段质量的WVMOS得分（浮点型）, "text": 对应时间戳的音频片段的转录文本（字符串类型）, "text_punc": 带标点符号的转录文本（字符串类型）, "spk_num": 音频片段中的说话人数量，取值为single（单说话人）或multi（多说话人）（字符串类型）, "confidence": 转录文本的置信度得分（浮点型）, "emotion": 说话人情感标签（字符串类型，例如：愤怒）, "age": 说话人年龄标签（整数范围，例如：中年（36~59岁））, "gender": 说话人性别标签（字符串类型，例如：男/女） } #### 数据样本（英文） ##### metadata.jsonl { "utt_id": Original long audio ID, "wav_utt_id": Converted long audio ID after transforming to WAV format, "source_audio_path": Path to the original long audio file, "audio_labels": Path to the label file of short audio segments cut from the converted long audio, "url": Download link for the original long audio } ##### audio_labels/wav_utt_id.jsonl { "wav_utt_id_timestamp": Short audio segment ID, composed of the converted long audio ID + timestamp information (type: str), "wav_utt_id_timestamp_path": Path to the short audio data (type: str), "audio_clip_id": Sequence number of this short segment within the long audio, "timestamp": Timestamp information, "wvmos_score": WVMOS score, measuring the quality of the audio segment (type: float), "text": Transcript of the audio segment corresponding to the timestamp (type: str), "text_punc": Transcript with punctuation (type: str), "spk_num": Number of speakers in the audio segment, single/multi (type: str), "confidence": Confidence score of the transcript (type: float), "emotion": Speaker’s emotion label (type: str, e.g., anger), "age": Speaker’s age label (type: int range, e.g., middle-aged (36–59)), "gender": Speaker’s gender label (type: str, e.g., male/female) } ### WenetSpeech 使用方式用户可通过元数据文件（`metadata.json`）中的`link`字段获取原始视频资源，依据`timestamps`字段对音频进行切分以提取对应片段。如需获取预处理后的音频数据，请通过下方提供的联系方式与我们取得联系。 ## 联系方式若您有任何疑问或合作意向，可通过邮箱`yhdai@mail.nwpu.edu.cn`或`ziyu_zhang@mail.nwpu.edu.cn`与我们的研究团队取得联系。您也可加入我们的微信技术交流群，获取最新动态与相关讨论——如前文所述，该群也可用于获取预处理后的音频数据。 <img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/raw/main/src/figs/wechat_2.png" width="300" alt="WeChat Group QR Code"/> 扫码加入我们的微信交流群 <img src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/npu@aslp.jpeg" width="300" alt="Official Account QR Code"/> 扫码关注官方公众号

提供机构：

maas

创建时间：

2025-10-23

搜集汇总

数据集介绍

背景与挑战

背景概述

WenetSpeech-Chuan是一个用于方言语音处理的大规模四川话（川渝方言）语料库，包含10,000小时语音数据，是目前最大的开源川渝方言资源。该数据集提供丰富注释，包括元数据如音频路径、时长、文本置信度、说话人身份、音频质量指标和字符级时间戳，并覆盖短视频、娱乐、直播等十个领域，支持多样化的语音研究。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集