Emotional Speech Dataset (ESD)

Name: Emotional Speech Dataset (ESD)
Creator: 新加坡国立大学电气与计算机工程系，新加坡科技与设计大学信息系统技术与设计
Published: 2021-02-11 10:30:45
License: 暂无描述

arXiv2021-02-11 更新2024-06-21 收录

下载链接：

https://github.com/HLTSingapore/Emotional-Speech-Data

下载链接

链接失效反馈

官方服务：

资源简介：

本研究发布了一个名为Emotional Speech Dataset (ESD)的多语言多说话人情感语音数据集，旨在支持语音合成和声音转换任务。该数据集包含350个并行语句，由10位英语和10位普通话母语者录制，涵盖五种情感状态：快乐、悲伤、中性、愤怒和惊讶。数据集的创建过程涉及使用预训练的语音情感识别模型来描述情感风格，并通过深度情感特征进行情感元素的分离和重组。ESD数据集的应用领域包括跨语言声音转换和情感文本到语音转换，旨在解决现有情感语音数据集的不足，提供更丰富的情感表达和语音合成能力。

This study releases a multilingual, multi-speaker emotional speech dataset named Emotional Speech Dataset (ESD), which is designed to support speech synthesis and voice conversion tasks. This dataset contains 350 parallel utterances recorded by 10 native English speakers and 10 native Mandarin speakers, covering five emotional states: happiness, sadness, neutrality, anger, and surprise. The dataset creation process involves using a pre-trained speech emotion recognition model to characterize emotional styles, as well as separating and reorganizing emotional elements via deep emotional features. Application scenarios of the ESD dataset include cross-lingual voice conversion and emotional text-to-speech conversion, aiming to address the shortcomings of existing emotional speech datasets and provide richer emotional expressions and enhanced speech synthesis capabilities.

提供机构：

新加坡国立大学电气与计算机工程系，新加坡科技与设计大学信息系统技术与设计

创建时间：

2020-10-28

搜集汇总

数据集介绍

构建方式

Emotional Speech Dataset (ESD) is a corpus designed to facilitate research in voice conversion, particularly focusing on emotional style transfer. The dataset was created by recording 350 parallel utterances from 10 native English and 10 native Mandarin speakers, each expressing five emotions: happy, sad, neutral, angry, and surprise. The audio data is sampled at 16kHz and stored in 16-bit format. The dataset's construction ensures a balance of emotions and languages, enabling researchers to explore various speech synthesis and voice conversion tasks.

使用方法

To utilize the ESD dataset, researchers typically employ a three-stage process: emotion descriptor training, encoder-decoder training with VAW-GAN, and run-time conversion. In the first stage, an auxiliary SER network is trained to extract deep emotional features from input utterances. These features are then used to condition the encoder-decoder training, which learns to disentangle emotional elements and reconstruct the input features. During run-time conversion, the trained model takes a neutral emotion source utterance and target deep emotional features to generate a target emotion. This process allows for the transfer of emotional styles without the need for parallel training data, making the dataset highly versatile for various voice conversion applications.

背景与挑战

背景概述

语音情感转换技术旨在在保持语言内容和说话人身份的同时，改变语音中的情感韵律。在以往的研究中，已经证明可以使用编码器-解码器网络在离散表示（如独热情感标签）的条件下解耦情感韵律。此类网络学习记住一组固定的情感风格。本文提出了一种基于变分自动编码器Wasserstein生成对抗网络（VAW-GAN）的新框架，该框架利用预训练的语音情感识别（SER）模型在训练和运行时推理过程中转移情感风格。通过这种方式，网络能够将已见和未见的情感风格转移到新的语句中。本文还标志着一个新的情感语音数据集（ESD）的发布，用于语音转换，该数据集具有多个说话人和语言。

当前挑战

该数据集相关的挑战包括：1）情感语音转换领域的问题，即如何在保持语言内容和说话人身份的同时，有效地转换语音中的情感韵律；2）构建过程中所遇到的挑战，例如如何设计一个能够处理多种情感风格转换的框架，以及如何解决未见过情感风格的转换问题。本文提出的基于VAW-GAN的情感风格转移框架成功地解决了这些问题，并为语音转换研究社区提供了一个多语言和多说话人的情感语音语料库。

常用场景

经典使用场景

情感语音转换旨在转换语音中的情感韵律，同时保留语言内容和说话者身份。ESD数据集提供了多说话者和多语言的语音样本，为研究情感语音转换提供了宝贵的资源。经典的使用场景包括：1) 情感文本到语音合成（TTS），通过转换语音的情感风格，使得合成语音更加生动自然；2) 对话式虚拟助手，通过转换语音的情感风格，使得虚拟助手更加具有亲和力和人性化。

解决学术问题

ESD数据集解决了情感语音转换研究中的一些常见问题，如：1) 缺乏开源的情感语音数据，ESD数据集的发布填补了这一空白；2) 情感语音转换需要平行训练数据，ESD数据集提供了多说话者和多语言的语音样本，为研究情感语音转换提供了丰富的数据资源；3) 情感语音转换框架通常只支持一对一的情绪转换，ESD数据集支持一对多的情绪转换，并能够生成未见过的情绪。

实际应用

ESD数据集在实际应用中具有广泛的应用场景，如：1) 情感文本到语音合成（TTS），通过转换语音的情感风格，使得合成语音更加生动自然；2) 对话式虚拟助手，通过转换语音的情感风格，使得虚拟助手更加具有亲和力和人性化；3) 语音转换，通过转换语音的情感风格，使得语音转换更加真实自然。

数据集最近研究