five

WSC-Train

收藏
魔搭社区2026-01-06 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/ASLP-lab/WSC-Train
下载链接
链接失效反馈
官方服务:
资源简介:
<h1 align="center"> WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus With Rich Annotation For Dialectal Speech Processing </h1> <p align="center"> Yuhang Dai<sup>1</sup><sup>,*</sup>, Ziyu Zhang<sup>1</sup><sup>,*</sup>, Shuai Wang<sup>4</sup><sup>,5</sup>, Longhao Li<sup>1</sup>, Zhao Guo<sup>1</sup>, Tianlun Zuo<sup>1</sup>, Shuiyuan Wang<sup>1</sup>, Hongfei Xue<sup>1</sup>, Chengyou Wang<sup>1</sup>, Qing Wang<sup>3</sup>, Xin Xu<sup>2</sup>, Hui Bu<sup>2</sup>, Jie Li<sup>3</sup>, Jian Kang<sup>3</sup>, Binbin Zhang<sup>5</sup>, Lei Xie<sup>1</sup><sup>,╀</sup> </p> <p align="center"> <sup>1</sup> Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University <br> <sup>2</sup> Beijing AISHELL Technology Co., Ltd. <br> <sup>3</sup> Institute of Artificial Intelligence (TeleAI), China Telecom <br> <sup>4</sup> School of Intelligence Science and Technology, Nanjing University <br> <sup>5</sup> WeNet Open Source Community <br> </p> <p align="center"> 📑 <a href="https://arxiv.org/abs/2509.18004">Paper</a> &nbsp&nbsp | &nbsp&nbsp 🐙 <a href="https://github.com/ASLP-lab/WenetSpeech-Chuan">GitHub</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/collections/ASLP-lab/wenetspeech-chuan-68bade9d02bcb1faece65bda">HuggingFace</a> <br> 🎤 <a href="https://aslp-lab.github.io/WenetSpeech-Chuan/">Demo Page</a> &nbsp&nbsp | &nbsp&nbsp 💬 <a href="https://github.com/ASLP-lab/WenetSpeech-Chuan?tab=readme-ov-file#contact">Contact Us</a> </p> <div align="center"> <img width="800px" src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/logo/WenetSpeech-Chuan-Logo.png?raw=true" /> </div> ## Dataset ### WenetSpeech-Chuan Overview * Contains 10,000 hours of large-scale Chuan-Yu dialect speech corpus with rich annotations, the largest open-source resource for Chuan-Yu dialect speech research.</li> * Stores metadata in a single JSON file, including audio path, duration, text confidence, speaker identity, SNR, DNSMOS, age, gender, and character-level timestamps. Additional metadata tags may be added in the future.</li> * Covers ten domains: Short videos, Entertainment, Live streams, Documentary, Audiobook, Drama, Interview, News and others.</li> <div align="center"> <img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/figs/domain.png?raw=true" width="300" style="display:inline-block; margin-right:10px;" /> <img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/figs/quality_distribution.jpg?raw=true" width="300" style="display:inline-block;" /> </div> ### Metadata Format We store all audio metadata in a standardized JSON format, where the core fields include `utt_id` (unique identifier for each audio segment), `rover_result` (ROVER result of three ASR transcriptions), `confidence` (confidence score of text transcription), `jyutping_confidence` (confidence score of Cantonese pinyin transcriptions), and `duration` (audio duration); speaker attributes include `speaker_id`, `gender`, and `age`; audio quality assessment metrics include `sample_rate`, `DNSMOS`, and `SNR`; timestamp information includes `timestamp` (precisely recording segment boundaries with `start` and `end`); and extended metadata under the `meta_info` field includes `program` (program name), `region` (geographical information), `link` (original content link), and `domain` (domain classification). #### 📂 Content Tree ``` WenetSpeech-Chuan ├── metadata.jsonl ├── .gitattributes └── README.md ``` <!-- WenetSpeech-Chuan ├── metadata.jsonl │ ├── audio_labels/ │ ├── wav_utt_id.jsonl │ ├── wav_utt_id.jsonl │ ├── ... │ └── wav_utt_id.jsonl │ ├── .gitattributes └── README.md --> #### Data sample: ###### metadata.jsonl {<br> "utt": 音频id, <br> "filename":音频文件名(type: str), <br> "text": 转录抄本(type: str), <br> "domain": 参考领域信息(type: list[str]), <br> "gender": 说话人性别(type: str), <br> "age": 说话人年龄标签 (type: int范围, eg: 中年(36~59)), <br> "wvmos": 音频质量分数(type: float), <br> "confidence": 转录文本置信度(0-1)(type: str), <br> "emotion": 说话人情感标签 (type: str,eg: 愤怒), <br> } <br> **example:** { <br> "utt": "013165495633_09mNC_9_5820", <br> "filename": "013165495633_09mNC_9_5820.wav", <br> "text": "还是选二手装好了的别墅诚心入如意的直接入住的好好", <br> "domain": [ <br> "短视频" <br> ], <br> "gender": "Male", <br> "age": "YOUTH", <br> "wvmos": 2.124380588531494, <br> "confidence": 0.8333, <br> "emotion": angry, <br> } <br> <!-- ###### audio_labels/wav_utt_id.jsonl: { <br> "wav_utt_id_timestamp": 以 转化为wav后的长音频id_时间戳信息 作为切分后的短音频id (type: str), <br> "wav_utt_id_timestamp_path": 短音频数据路径 (type: str), <br> "audio_clip_id": 该段短音频在长音频中的切分顺序编号, <br> "timestamp": 时间戳信息, <br> "wvmos_score": wvmos分数,衡量音频片段质量 (type: float), <br> "text": 对应时间戳的音频片段的抄本 (type: str), <br> "text_punc": 带标点的抄本 (type: str), <br> "spk_num": 音频片段说话人个数,single/multi (type: str) <br> "confidence": 抄本置信度 (type: float), <br> "emotion": 说话人情感标签 (type: str,eg: 愤怒), <br> "age": 说话人年龄标签 (type: int范围, eg: 中年(36~59)), <br> "gender": 说话人性别标签 (type: str,eg: 男/女), <br> } <br> --> <!-- #### Data sample(EN): ###### metadata.jsonl { <br> "utt_id": Original long audio ID, <br> "wav_utt_id": Converted long audio ID after transforming to WAV format, <br> "source_audio_path": Path to the original long audio file, <br> "audio_labels": Path to the label file of short audio segments cut from the converted long audio, <br> "url": Download link for the original long audio <br> } <br> ###### audio_labels/wav_utt_id.jsonl: { <br> "wav_utt_id_timestamp": Short audio segment ID, composed of the converted long audio ID + timestamp information (type: str), <br> "wav_utt_id_timestamp_path": Path to the short audio data (type: str), <br> "audio_clip_id": Sequence number of this short segment within the long audio, <br> "timestamp": Timestamp information, <br> "wvmos_score": WVMOS score, measuring the quality of the audio segment (type: float), <br> "text": Transcript of the audio segment corresponding to the timestamp (type: str), <br> "text_punc": Transcript with punctuation (type: str), <br> "spk_num": Number of speakers in the audio segment, single/multi (type: str), <br> "confidence": Confidence score of the transcript (type: float), <br> "emotion": Speaker’s emotion label (type: str, e.g., anger), <br> "age": Speaker’s age label (type: int range, e.g., middle-aged (36–59)), <br> "gender": Speaker’s gender label (type: str, e.g., male/female) <br> } <br> --> ### WenetSpeech Usage You can obtain the original video source through the `link` field in the metadata file (`metadata.json`). Segment the audio according to the `timestamps` field to extract the corresponding record. For pre-processed audio data, please contact us using the information provided below. ## Contact If you have any questions or would like to collaborate, feel free to reach out to our research team via email: yhdai@mail.nwpu.edu.cn or ziyu_zhang@mail.nwpu.edu.cn. You’re also welcome to join our WeChat group for technical discussions, updates, and — as mentioned above — access to pre-processed audio data. <p align="center"> <img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/raw/main/src/figs/wechat.jpg" width="300" alt="WeChat Group QR Code"/> <em>Scan to join our WeChat discussion group</em> </p> <p align="center"> <img src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/npu@aslp.jpeg" width="300" alt="Official Account QR Code"/> </p>

<h1 align="center">WenetSpeech-Chuan:面向方言语音处理的大规模川渝方言标注语料库</h1> <p align="center"> 戴宇航<sup>1</sup><sup>,*</sup>, 张子瑜<sup>1</sup><sup>,*</sup>, 王帅<sup>4</sup><sup>,5</sup>, 李豪龙<sup>1</sup>, 郭昭<sup>1</sup>, 左天伦<sup>1</sup>, 王水元<sup>1</sup>, 薛鸿飞<sup>1</sup>, 王承友<sup>1</sup>, 王清<sup>3</sup>, 徐鑫<sup>2</sup>, 卜辉<sup>2</sup>, 李杰<sup>3</sup>, 康健<sup>3</sup>, 张彬彬<sup>5</sup>, 谢磊<sup>1</sup><sup>,╀</sup> </p> <p align="center"> <sup>1</sup> 西北工业大学音频、语音与语言处理课题组(ASLP@NPU)<br> <sup>2</sup> 北京爱数智慧科技有限公司(AISHELL Technology Co., Ltd.)<br> <sup>3</sup> 中国电信人工智能研究院(TeleAI)<br> <sup>4</sup> 南京大学智能科学与技术学院<br> <sup>5</sup> WeNet开源社区<br> </p> <p align="center"> 📑 <a href="https://arxiv.org/abs/2509.18004">论文</a> &nbsp&nbsp | &nbsp&nbsp 🐙 <a href="https://github.com/ASLP-lab/WenetSpeech-Chuan">GitHub仓库</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/collections/ASLP-lab/wenetspeech-chuan-68bade9d02bcb1faece65bda">HuggingFace数据集集合</a> <br> 🎤 <a href="https://aslp-lab.github.io/WenetSpeech-Chuan/">演示页面</a> &nbsp&nbsp | &nbsp&nbsp 💬 <a href="https://github.com/ASLP-lab/WenetSpeech-Chuan?tab=readme-ov-file#contact">联系我们</a> </p> <div align="center"> <img width="800px" src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/logo/WenetSpeech-Chuan-Logo.png?raw=true" /> </div> ## 数据集 ### WenetSpeech-Chuan 数据集概览 * 包含总计10000小时的大规模川渝方言语音语料库及丰富标注信息,是目前面向川渝方言语音研究的最大规模开源资源。 * 元数据以单一JSON文件存储,涵盖音频路径、时长、文本置信度、说话人身份、信噪比(SNR)、DNSMOS评分、年龄、性别以及字符级时间戳等信息,未来将新增更多元数据标签。 * 涵盖短视频、娱乐、直播、纪录片、有声书、戏曲、访谈、新闻等十大领域。 <div align="center"> <img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/figs/domain.png?raw=true" width="300" style="display:inline-block; margin-right:10px;" /> <img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/blob/main/src/figs/quality_distribution.jpg?raw=true" width="300" style="display:inline-block;" /> </div> ### 元数据格式 我们将所有音频元数据以标准化JSON格式存储,核心字段包括`utt_id`(单条音频片段的唯一标识符)、`rover_result`(三种自动语音识别(Automatic Speech Recognition, ASR)转录结果的ROVER融合结果)、`confidence`(文本转录置信度评分)、`jyutping_confidence`(粤式拼音转录置信度评分)以及`duration`(音频时长);说话人属性字段包含`speaker_id`、`gender`(性别)和`age`(年龄);音频质量评估指标包含`sample_rate`(采样率)、`DNSMOS`以及`SNR`(信噪比);时间戳信息字段`timestamp`通过`start`和`end`精准记录片段边界;`meta_info`字段下的扩展元数据包含`program`(节目名称)、`region`(地域信息)、`link`(原始内容链接)以及`domain`(领域分类)。 #### 📂 目录结构 WenetSpeech-Chuan ├── metadata.jsonl ├── .gitattributes └── README.md #### 数据示例 ###### metadata.jsonl 文件示例 {<br> "utt": "音频片段唯一标识符", <br> "filename": "音频文件名(字符串类型)", <br> "text": "语音转录文本(字符串类型)", <br> "domain": "领域分类列表(字符串数组类型)", <br> "gender": "说话人性别(字符串类型)", <br> "age": "说话人年龄标签(整数范围,例如:中年(36~59岁))", <br> "wvmos": "音频质量评分(浮点类型)", <br> "confidence": "转录文本置信度(取值范围0-1,字符串类型)", <br> "emotion": "说话人情感标签(字符串类型,例如:愤怒)", <br> } <br> **示例:** { <br> "utt": "013165495633_09mNC_9_5820", <br> "filename": "013165495633_09mNC_9_5820.wav", <br> "text": "还是选二手装好了的别墅诚心入如意的直接入住的好好", <br> "domain": [ <br> "短视频" <br> ], <br> "gender": "男性", <br> "age": "青年", <br> "wvmos": 2.124380588531494, <br> "confidence": 0.8333, <br> "emotion": "愤怒", <br> } <br> ### 数据集使用方式 您可通过元数据文件(`metadata.json`)中的`link`字段获取原始视频源,亦可依据`timestamps`字段对音频进行切分以提取对应语音片段。如需获取预处理后的音频数据,请通过下方联系方式与我们取得联系。 ## 联系方式 若您有任何疑问或合作意向,可通过以下邮箱联系我们的研究团队:yhdai@mail.nwpu.edu.cn 或 ziyu_zhang@mail.nwpu.edu.cn。 您也可加入我们的微信技术交流群,获取最新动态、开展技术讨论,并且如前文所述,可通过该群获取预处理后的音频数据。 <p align="center"> <img src="https://github.com/ASLP-lab/WenetSpeech-Chuan/raw/main/src/figs/wechat.jpg" width="300" alt="微信交流群二维码"/> <em>扫码加入我们的微信讨论群</em> </p> <p align="center"> <img src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/npu@aslp.jpeg" width="300" alt="公众号二维码"/> </p>
提供机构:
maas
创建时间:
2025-09-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作