WenetSpeech-Chuan
收藏魔搭社区2026-05-21 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/pengzhendong/WenetSpeech-Chuan
下载链接
链接失效反馈官方服务:
资源简介:
<h1 align="center">
WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus With Rich Annotation For Dialectal Speech Processing
</h1>
<p align="center">
Yuhang Dai<sup>1</sup><sup>,*</sup>, Ziyu Zhang<sup>1</sup><sup>,*</sup>, Shuai Wang<sup>4</sup><sup>,5</sup>,
Longhao Li<sup>1</sup>, Zhao Guo<sup>1</sup>, Tianlun Zuo<sup>1</sup>,
Shuiyuan Wang<sup>1</sup>, Hongfei Xue<sup>1</sup>, Chengyou Wang<sup>1</sup>,
Qing Wang<sup>3</sup>, Xin Xu<sup>2</sup>, Hui Bu<sup>2</sup>, Jie Li<sup>3</sup>,
Jian Kang<sup>3</sup>, Binbin Zhang<sup>5</sup>, Lei Xie<sup>1</sup><sup>,╀</sup>
</p>
<p align="center">
<sup>1</sup> Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University <br>
<sup>2</sup> Beijing AISHELL Technology Co., Ltd. <br>
<sup>3</sup> Institute of Artificial Intelligence (TeleAI), China Telecom <br>
<sup>4</sup> School of Intelligence Science and Technology, Nanjing University <br>
<sup>5</sup> WeNet Open Source Community <br>
</p>
<p align="center">
📑 <a href="https://arxiv.org/abs/2509.18004">Paper</a>    |   
🐙 <a href="https://github.com/ASLP-lab/WenetSpeech-Chuan">GitHub</a>    |   
🤗 <a href="https://huggingface.co/collections/ASLP-lab/wenetspeech-chuan-68bade9d02bcb1faece65bda">HuggingFace</a>
<br>
🎤 <a href="https://aslp-lab.github.io/WenetSpeech-Chuan/">Demo Page</a>    |   
💬 <a href="https://github.com/ASLP-lab/WenetSpeech-Chuan?tab=readme-ov-file#contact">Contact Us</a>
</p>
<div align="center">
<img width="800px" src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/logo/WenetSpeech-Chuan-Logo.png?raw=true" />
</div>
## Dataset
### WenetSpeech-Chuan Overview
* Contains 10,000 hours of large-scale Chuan-Yu dialect speech corpus with rich annotations, the largest open-source resource for Chuan-Yu dialect speech research.</li>
* Stores metadata in a single JSON file, including audio path, duration, text confidence, speaker identity, SNR, DNSMOS, age, gender, and character-level timestamps. Additional metadata tags may be added in the future.</li>
* Covers ten domains: Short videos, Entertainment, Live streams, Documentary, Audiobook, Drama, Interview, News and others.</li>
<div align="center">
<img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/figs/domain.png?raw=true" width="300" style="display:inline-block; margin-right:10px;" />
<img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/figs/quality_distribution.jpg?raw=true" width="300" style="display:inline-block;" />
</div>
### Metadata Format
We store all audio metadata in a standardized JSON format, where the core fields include `utt_id` (unique identifier for each audio segment), `rover_result` (ROVER result of three ASR transcriptions), `confidence` (confidence score of text transcription), `jyutping_confidence` (confidence score of Cantonese pinyin transcriptions), and `duration` (audio duration); speaker attributes include `speaker_id`, `gender`, and `age`; audio quality assessment metrics include `sample_rate`, `DNSMOS`, and `SNR`; timestamp information includes `timestamp` (precisely recording segment boundaries with `start` and `end`); and extended metadata under the `meta_info` field includes `program` (program name), `region` (geographical information), `link` (original content link), and `domain` (domain classification).
#### 📂 Content Tree
```
WenetSpeech-Chuan
├── metadata.jsonl
│
├── audio_labels/
│ ├── wav_utt_id.jsonl
│ ├── wav_utt_id.jsonl
│ ├── ...
│ └── wav_utt_id.jsonl
│
├── .gitattributes
└── README.md
```
#### Data sample(CN):
###### metadata.jsonl
```
{
"utt_id": 原始长音频id,
"wav_utt_id": 转化为wav后的长音频id,
"source_audio_path": 原始长音频路径,
"audio_labels": 转化后的长音频切分出的短音频标签文件路径,
"url": 原始长音频下载链接
}
```
###### audio_labels/wav_utt_id.jsonl:
```
{
"wav_utt_id_timestamp": 以 转化为wav后的长音频id_时间戳信息 作为切分后的短音频id (type: str),
"wav_utt_id_timestamp_path": 短音频数据路径 (type: str),
"audio_clip_id": 该段短音频在长音频中的切分顺序编号,
"timestamp": 时间戳信息,
"wvmos_score": wvmos分数,衡量音频片段质量 (type: float),
"text": 对应时间戳的音频片段的抄本 (type: str),
"text_punc": 带标点的抄本 (type: str),
"spk_num": 音频片段说话人个数,single/multi (type: str)
"confidence": 抄本置信度 (type: float),
"emotion": 说话人情感标签 (type: str,eg: 愤怒),
"age": 说话人年龄标签 (type: int范围, eg: 中年(36~59)),
"gender": 说话人性别标签 (type: str,eg: 男/女),
}
```
#### Data sample(EN):
###### metadata.jsonl
```
{
"utt_id": Original long audio ID,
"wav_utt_id": Converted long audio ID after transforming to WAV format,
"source_audio_path": Path to the original long audio file,
"audio_labels": Path to the label file of short audio segments cut from the converted long audio,
"url": Download link for the original long audio
}
```
###### audio_labels/wav_utt_id.jsonl:
```
{
"wav_utt_id_timestamp": Short audio segment ID, composed of the converted long audio ID + timestamp information (type: str),
"wav_utt_id_timestamp_path": Path to the short audio data (type: str),
"audio_clip_id": Sequence number of this short segment within the long audio,
"timestamp": Timestamp information,
"wvmos_score": WVMOS score, measuring the quality of the audio segment (type: float),
"text": Transcript of the audio segment corresponding to the timestamp (type: str),
"text_punc": Transcript with punctuation (type: str),
"spk_num": Number of speakers in the audio segment, single/multi (type: str),
"confidence": Confidence score of the transcript (type: float),
"emotion": Speaker’s emotion label (type: str, e.g., anger),
"age": Speaker’s age label (type: int range, e.g., middle-aged (36–59)),
"gender": Speaker’s gender label (type: str, e.g., male/female)
}
```
### WenetSpeech Usage
You can obtain the original video source through the `link` field in the metadata file (`metadata.json`). Segment the audio according to the `timestamps` field to extract the corresponding record. For pre-processed audio data, please contact us using the information provided below.
## Contact
If you have any questions or would like to collaborate, feel free to reach out to our research team via email: yhdai@mail.nwpu.edu.cn or ziyu_zhang@mail.nwpu.edu.cn.
You’re also welcome to join our WeChat group for technical discussions, updates, and — as mentioned above — access to pre-processed audio data.
<p align="center">
<img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/raw/main/src/figs/wechat_2.png" width="300" alt="WeChat Group QR Code"/>
<em>Scan to join our WeChat discussion group</em>
</p>
<p align="center">
<img src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/npu@aslp.jpeg" width="300" alt="Official Account QR Code"/>
</p>
# WenetSpeech-Chuan:面向方言语音处理的大规模川渝方言标注语料库
**作者**:戴宇航<sup>1</sup><sup>,*</sup>、张子宇<sup>1</sup><sup>,*</sup>、王帅<sup>4</sup><sup>,5</sup>、李龙昊<sup>1</sup>、郭钊<sup>1</sup>、左天伦<sup>1</sup>、王水原<sup>1</sup>、薛鸿飞<sup>1</sup>、王承友<sup>1</sup>、王青<sup>3</sup>、徐鑫<sup>2</sup>、卜辉<sup>2</sup>、李杰<sup>3</sup>、康健<sup>3</sup>、张彬彬<sup>5</sup>、谢磊<sup>1</sup><sup>,╀</sup>
<sup>1</sup> 西北工业大学音频、语音与语言处理组(ASLP@NPU)<br>
<sup>2</sup> 北京爱数智慧科技有限公司<br>
<sup>3</sup> 中国电信人工智能研究院(TeleAI)<br>
<sup>4</sup> 南京大学智能科学与技术学院<br>
<sup>5</sup> WeNet开源社区
<p align="center">📑 <a href="https://arxiv.org/abs/2509.18004">论文</a>    |    🐙 <a href="https://github.com/ASLP-lab/WenetSpeech-Chuan">GitHub仓库</a>    |    🤗 <a href="https://huggingface.co/collections/ASLP-lab/wenetspeech-chuan-68bade9d02bcb1faece65bda">HuggingFace数据集集合</a><br>🎤 <a href="https://aslp-lab.github.io/WenetSpeech-Chuan/">演示页面</a>    |    💬 <a href="https://github.com/ASLP-lab/WenetSpeech-Chuan?tab=readme-ov-file#contact">联系我们</a></p>
<div align="center">
<img width="800px" src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/logo/WenetSpeech-Chuan-Logo.png?raw=true" />
</div>
## 数据集
### WenetSpeech-Chuan 总览
* 包含10000小时的大规模川渝方言语音语料库,附带丰富标注信息,是目前开源的规模最大的川渝方言语音研究资源。
* 元数据以单一JSON文件存储,涵盖音频路径、时长、文本置信度、说话人身份、信噪比(SNR, Signal-to-Noise Ratio)、DNSMOS、年龄、性别以及字符级时间戳,未来或将新增更多元数据标签。
* 涵盖十大领域:短视频、娱乐、直播、纪录片、有声书、戏剧、访谈、新闻及其他领域。
<div align="center">
<img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/figs/domain.png?raw=true" width="300" style="display:inline-block; margin-right:10px;" />
<img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/figs/quality_distribution.jpg?raw=true" width="300" style="display:inline-block;" />
</div>
### 元数据格式
我们将所有音频元数据以标准化JSON格式存储,核心字段包括:
- `utt_id`:每条音频片段的唯一标识符
- `rover_result`:三种自动语音识别(ASR, Automatic Speech Recognition)转录结果的ROVER融合结果
- `confidence`:文本转录置信度得分
- `jyutping_confidence`:拼音转录置信度得分
- `duration`:音频时长
- 说话人属性:`speaker_id`、`gender`(性别)与`age`(年龄)
- 音频质量评估指标:`sample_rate`(采样率)、`DNSMOS`与`SNR`(信噪比)
- 时间戳信息:`timestamp`,精确记录片段起止点,包含`start`(起始时刻)与`end`(结束时刻)
- `meta_info`字段下的扩展元数据包括:`program`(节目名称)、`region`(地理信息)、`link`(原始内容链接)与`domain`(领域分类)
#### 📂 目录结构
WenetSpeech-Chuan
├── metadata.jsonl
│
├── audio_labels/
│ ├── wav_utt_id.jsonl
│ ├── wav_utt_id.jsonl
│ ├── ...
│ └── wav_utt_id.jsonl
│
├── .gitattributes
└── README.md
#### 数据样本(中文)
##### metadata.jsonl
{
"utt_id": 原始长音频标识符,
"wav_utt_id": 转换为WAV格式后的长音频标识符,
"source_audio_path": 原始长音频文件路径,
"audio_labels": 从转换后的长音频中切分出的短音频片段的标签文件路径,
"url": 原始长音频下载链接
}
##### audio_labels/wav_utt_id.jsonl
{
"wav_utt_id_timestamp": 以转换后的长音频标识符+时间戳信息组成的短音频片段ID(字符串类型),
"wav_utt_id_timestamp_path": 短音频数据路径(字符串类型),
"audio_clip_id": 该短音频片段在长音频中的序列编号,
"timestamp": 时间戳信息,
"wvmos_score": 用于衡量音频片段质量的WVMOS得分(浮点型),
"text": 对应时间戳的音频片段的转录文本(字符串类型),
"text_punc": 带标点符号的转录文本(字符串类型),
"spk_num": 音频片段中的说话人数量,取值为single(单说话人)或multi(多说话人)(字符串类型),
"confidence": 转录文本的置信度得分(浮点型),
"emotion": 说话人情感标签(字符串类型,例如:愤怒),
"age": 说话人年龄标签(整数范围,例如:中年(36~59岁)),
"gender": 说话人性别标签(字符串类型,例如:男/女)
}
#### 数据样本(英文)
##### metadata.jsonl
{
"utt_id": Original long audio ID,
"wav_utt_id": Converted long audio ID after transforming to WAV format,
"source_audio_path": Path to the original long audio file,
"audio_labels": Path to the label file of short audio segments cut from the converted long audio,
"url": Download link for the original long audio
}
##### audio_labels/wav_utt_id.jsonl
{
"wav_utt_id_timestamp": Short audio segment ID, composed of the converted long audio ID + timestamp information (type: str),
"wav_utt_id_timestamp_path": Path to the short audio data (type: str),
"audio_clip_id": Sequence number of this short segment within the long audio,
"timestamp": Timestamp information,
"wvmos_score": WVMOS score, measuring the quality of the audio segment (type: float),
"text": Transcript of the audio segment corresponding to the timestamp (type: str),
"text_punc": Transcript with punctuation (type: str),
"spk_num": Number of speakers in the audio segment, single/multi (type: str),
"confidence": Confidence score of the transcript (type: float),
"emotion": Speaker’s emotion label (type: str, e.g., anger),
"age": Speaker’s age label (type: int range, e.g., middle-aged (36–59)),
"gender": Speaker’s gender label (type: str, e.g., male/female)
}
### WenetSpeech 使用方式
用户可通过元数据文件(`metadata.json`)中的`link`字段获取原始视频资源,依据`timestamps`字段对音频进行切分以提取对应片段。如需获取预处理后的音频数据,请通过下方提供的联系方式与我们取得联系。
## 联系方式
若您有任何疑问或合作意向,可通过邮箱`yhdai@mail.nwpu.edu.cn`或`ziyu_zhang@mail.nwpu.edu.cn`与我们的研究团队取得联系。
您也可加入我们的微信技术交流群,获取最新动态与相关讨论——如前文所述,该群也可用于获取预处理后的音频数据。
<p align="center">
<img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/raw/main/src/figs/wechat_2.png" width="300" alt="WeChat Group QR Code"/>
<em>扫码加入我们的微信交流群</em>
</p>
<p align="center">
<img src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/npu@aslp.jpeg" width="300" alt="Official Account QR Code"/>
<em>扫码关注官方公众号</em>
</p>
提供机构:
maas
创建时间:
2025-10-23
搜集汇总
数据集介绍

背景与挑战
背景概述
WenetSpeech-Chuan是一个用于方言语音处理的大规模四川话(川渝方言)语料库,包含10,000小时语音数据,是目前最大的开源川渝方言资源。该数据集提供丰富注释,包括元数据如音频路径、时长、文本置信度、说话人身份、音频质量指标和字符级时间戳,并覆盖短视频、娱乐、直播等十个领域,支持多样化的语音研究。
以上内容由遇见数据集搜集并总结生成



