NCSSD
收藏魔搭社区2025-12-29 更新2024-11-30 收录
下载链接:
https://modelscope.cn/datasets/pengzhendong/NCSSD
下载链接
链接失效反馈官方服务:
资源简介:
# NCSSD
## 🎉Introduction
This is the official repository for the NCSSD dataset and collecting pipeline to handle TV shows. [《Generative Expressive Conversational Speech Synthesis》](https://arxiv.org/pdf/2407.21491)
(Accepted by MM'2024)
[Rui Liu *](https://ttslr.github.io/), Yifan Hu, [Yi Ren](https://rayeren.github.io/), Xiang Yin, [Haizhou Li](https://colips.org/~eleliha/).
## 📜NCSSD Overview
Includes Recording subsets: R-ZH, R-EN and Collection subsets: C-ZH, C-EN.
<div align=center><img width="500" height="340" src="image-1.png"/></div>
## 📣NCSSD Download
⭐ Huggingface download address: [NCSSD](https://huggingface.co/datasets/walkerhyf/NCSSD).
⭐ Users in China can contact the email (📧: ``hyfwalker@163.com``) to obtain the Baidu Cloud address, but you need to provide necessary information such as name, organization, profession, etc.
## 💻Collection Subset Pipeline
<div align=center><img width="800" height="220" src="image.png"/></div>
### 1. Video Selection
#### 1.1 Prepare TV shows and name it: **TV name-episode number**.
#### 1.2. Extract the audios from MKV videos (video_file: input video file name, output_file: output audio file name).
```
python ./step-0.py --input_video_path "xxx.mkv" --output_audio_path "xxx.wav"
```
<!-- Dialogue Scene Extraction -->
### 2. Dialogue Scene Extraction
#### 2.1 Use VAD to segment speech audio, split into two segments if the silent interval is greater than 4 seconds, and retain segments with more than 30% valid speech duration and longer than 15 seconds.
```
python ./step-1.py --audio_root_path "xxx"
```
<!-- Demucs -->
#### 2.2 Use Demucs for vocal and background separation.
##### (1) To install Demucs, you can refer to the official documentation or installation instructions provided at the following link: [https://github.com/facebookresearch/demucs](https://github.com/facebookresearch/demucs).
##### (2) Use the Demucs mentioned above to separate vocals and background sounds, and keep the vocals part with SNR<=4.
```
python ./step-2.py --audio_root_path "xxx"
```
<!-- sepformer -->
#### 2.3 Use SepFormer for voice enhancement.
##### (1) To install SepFormer, you can refer to the official documentation or installation instructions provided at the following link: [https://huggingface.co/speechbrain/sepformer-dns4-16k-enhancement](https://huggingface.co/speechbrain/sepformer-dns4-16k-enhancement). (*vocals_16k_path* is the folder generated in a previous step, located in the **one-step** directory.)
```
python ./step-3.py --vocals_16k_path "yyy"
```
<!-- Speaker -->
### 3. Dialogue Segment Extraction
We use the [Volcengine](https://console.volcengine.com/speech/app) for speaker recognition, extracting different conversation scenes. Please configure ASR information such as ``appid``,``token``, and OSS information such as ``access_key_id``,``access_key_secret``,``bucket_name`` (for generating URLs to be used for ASR)
```
python ./step-4.py --audio_root_path "xxx"
```
### 4. Dialogue Script Recognition
#### Use Aliyun's ASR service for re-recognition and correction.
We use the [Aliyun's ASR](https://ai.aliyun.com/nls/filetrans?spm=5176.28508143.nav-v2-dropdown-menu-0.d_main_9_1_1_1.5421154aIHmaWo&scm=20140722.X_data-b7a761a1c730419a6c79._.V_1) for dialogue script recognition. Please configure ASR information such as ``accessKeyId``,``accessKeySecret``, and OSS information such as ``access_key_id``,``access_key_secret``,``bucket_name`` (for generating URLs to be used for ASR).
⚠ ``appkey``: Pay attention to the Chinese and English settings.
```
python ./step-5.py --audio_root_path "xxx"
```
### 5. Organizing the Data
Organize the data from the above steps into a standard format, with *result_path* as the output result path.
```
python step-6.py --audio_root_path "xxx" --result_path "yyy"
```
🎉🎉🎉 ***Congratulations! The dataset was created successfully!***
## Citations
```bibtex
@inproceedings{10.1145/3664647.3681697,
author = {Liu, Rui and Hu, Yifan and Ren, Yi and Yin, Xiang and Li, Haizhou},
title = {Generative Expressive Conversational Speech Synthesis},
year = {2024},
isbn = {9798400706868},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3664647.3681697},
doi = {10.1145/3664647.3681697},
abstract = {Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting. Existing CSS methods employ effective multi-modal context modeling techniques to achieve empathy understanding and expression. However, they often need to design complex network architectures and meticulously optimize the modules within them. In addition, due to the limitations of small-scale datasets containing scripted recording styles, they often fail to simulate real natural conversational styles. To address the above issues, we propose a novel generative expressive CSS system, termed GPT-Talker.We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context. Leveraging the power of GPT, we predict the token sequence, that includes both semantic and style knowledge, of response for the agent. After that, the expressive conversational speech is synthesized by the conversation-enriched VITS to deliver feedback to the user.Furthermore, we propose a large-scale Natural CSS Dataset called NCSSD, that includes both naturally recorded conversational speech in improvised styles and dialogues extracted from TV shows. It encompasses both Chinese and English languages, with a total duration of 236 hours. We conducted comprehensive experiments on the reliability of the NCSSD and the effectiveness of our GPT-Talker. Both subjective and objective evaluations demonstrate that our model outperforms other state-of-the-art CSS systems significantly in terms of naturalness and expressiveness. The Code, Dataset, and Pre-trained Model are available at: https://github.com/AI-S2-Lab/GPT-Talker.},
booktitle = {Proceedings of the 32nd ACM International Conference on Multimedia},
pages = {4187–4196},
numpages = {10},
keywords = {conversational speech synthesis (css), expressiveness, gpt, user-agent conversation},
location = {Melbourne VIC, Australia},
series = {MM '24}
}
```
⚠ The collected TV shows clips are all from public resources on the Internet. If there is any infringement, please contact us to delete them. (📧: ``hyfwalker@163.com``)
# NCSSD
## 🎉 项目简介
本仓库为NCSSD数据集官方仓库,配套面向剧集的数据集采集流水线。论文《生成式富有表现力的对话语音合成(Generative Expressive Conversational Speech Synthesis)》已被MM'2024接收。
[刘锐 *](https://ttslr.github.io/)、胡逸凡、[任毅](https://rayeren.github.io/)、尹翔、李海洲
## 📜 NCSSD 数据集概览
本数据集包含录制子集R-ZH、R-EN,以及采集子集C-ZH、C-EN。
<div align=center><img width="500" height="340" src="image-1.png"/></div>
## 📣 NCSSD 数据集下载
⭐ Hugging Face 下载地址:[NCSSD](https://huggingface.co/datasets/walkerhyf/NCSSD)。
⭐ 国内用户可通过邮箱(📧: `hyfwalker@163.com`)获取百度网盘下载地址,需提供姓名、所属机构、职业等必要信息。
## 💻 采集子集流水线
<div align=center><img width="800" height="220" src="image.png"/></div>
### 1. 剧集选取
#### 1.1 准备剧集文件,并按「剧集名称-集数」的格式命名。
#### 1.2 从MKV格式视频中提取音频(参数说明:`video_file`为输入视频文件名,`output_file`为输出音频文件名)。
python ./step-0.py --input_video_path "xxx.mkv" --output_audio_path "xxx.wav"
<!-- 对话场景提取 -->
### 2. 对话场景提取
#### 2.1 使用语音活动检测(Voice Activity Detection, VAD)对语音音频进行分段:当静音间隔超过4秒时将音频切分为两段,保留有效语音时长占比超过30%且时长大于15秒的片段。
python ./step-1.py --audio_root_path "xxx"
<!-- Demucs 人声背景分离 -->
#### 2.2 使用Demucs进行人声与背景音分离。
##### (1) Demucs的安装可参考官方文档或下述链接提供的安装指南:[https://github.com/facebookresearch/demucs](https://github.com/facebookresearch/demucs)。
##### (2) 使用上述Demucs分离人声与背景音,保留信噪比(Signal-to-Noise Ratio, SNR)不大于4的人声片段。
python ./step-2.py --audio_root_path "xxx"
<!-- SepFormer 语音增强 -->
#### 2.3 使用SepFormer进行语音增强。
##### (1) SepFormer的安装可参考官方文档或下述链接提供的安装指南:[https://huggingface.co/speechbrain/sepformer-dns4-16k-enhancement](https://huggingface.co/speechbrain/sepformer-dns4-16k-enhancement)。(`vocals_16k_path` 为上一步生成的文件夹,位于**one-step**目录下。)
python ./step-3.py --vocals_16k_path "yyy"
<!-- 说话人识别 -->
### 3. 对话片段提取
我们使用[火山引擎(Volcengine)](https://console.volcengine.com/speech/app)进行说话人识别,以提取不同的对话场景。请配置自动语音识别(Automatic Speech Recognition, ASR)相关信息,包括`appid`、`token`,以及对象存储(Object Storage Service, OSS)相关信息,包括`access_key_id`、`access_key_secret`、`bucket_name`(用于生成供ASR使用的链接)。
python ./step-4.py --audio_root_path "xxx"
### 4. 对话脚本识别
使用阿里云ASR服务进行二次识别与校正。
我们使用[阿里云ASR](https://ai.aliyun.com/nls/filetrans?spm=5176.28508143.nav-v2-dropdown-menu-0.d_main_9_1_1_1.5421154aIHmaWo&scm=20140722.X_data-b7a761a1c730419a6c79._.V_1)进行对话脚本识别。请配置ASR相关信息,包括`accessKeyId`、`accessKeySecret`,以及OSS相关信息,包括`access_key_id`、`access_key_secret`、`bucket_name`(用于生成供ASR使用的链接)。⚠ `appkey`:请注意区分中英文设置。
python ./step-5.py --audio_root_path "xxx"
### 5. 数据整理
将上述步骤生成的数据整理为标准格式,以`result_path`作为输出结果路径。
python step-6.py --audio_root_path "xxx" --result_path "yyy"
🎉🎉🎉 ***恭喜!数据集已成功构建!***
## 引用格式
bibtex
@inproceedings{10.1145/3664647.3681697,
author = {Liu, Rui and Hu, Yifan and Ren, Yi and Yin, Xiang and Li, Haizhou},
title = {Generative Expressive Conversational Speech Synthesis},
year = {2024},
isbn = {9798400706868},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3664647.3681697},
doi = {10.1145/3664647.3681697},
abstract = {Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting. Existing CSS methods employ effective multi-modal context modeling techniques to achieve empathy understanding and expression. However, they often need to design complex network architectures and meticulously optimize the modules within them. In addition, due to the limitations of small-scale datasets containing scripted recording styles, they often fail to simulate real natural conversational styles. To address the above issues, we propose a novel generative expressive CSS system, termed GPT-Talker.We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context. Leveraging the power of GPT, we predict the token sequence, that includes both semantic and style knowledge, of response for the agent. After that, the expressive conversational speech is synthesized by the conversation-enriched VITS to deliver feedback to the user.Furthermore, we propose a large-scale Natural CSS Dataset called NCSSD, that includes both naturally recorded conversational speech in improvised styles and dialogues extracted from TV shows. It encompasses both Chinese and English languages, with a total duration of 236 hours. We conducted comprehensive experiments on the reliability of the NCSSD and the effectiveness of our GPT-Talker. Both subjective and objective evaluations demonstrate that our model outperforms other state-of-the-art CSS systems significantly in terms of naturalness and expressiveness. The Code, Dataset, and Pre-trained Model are available at: https://github.com/AI-S2-Lab/GPT-Talker.},
booktitle = {Proceedings of the 32nd ACM International Conference on Multimedia},
pages = {4187–4196},
numpages = {10},
keywords = {conversational speech synthesis (css), expressiveness, gpt, user-agent conversation},
location = {Melbourne VIC, Australia},
series = {MM '24}
}
⚠ 本项目采集的剧集片段均来自互联网公开资源。若存在侵权内容,请联系我们删除。(📧: `hyfwalker@163.com`)
提供机构:
maas
创建时间:
2024-11-27
搜集汇总
数据集介绍

背景与挑战
背景概述
NCSSD是一个大规模自然对话语音合成数据集,包含236小时的中英文录音,涵盖录制和收集两种子集,主要用于表达性对话语音合成研究。数据集提供了完整的数据收集和处理流程说明,并已发表在ACM Multimedia 2024会议上。
以上内容由遇见数据集搜集并总结生成



