ExpressiveSpeech
收藏魔搭社区2026-01-06 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/FreedomIntelligence/ExpressiveSpeech
下载链接
链接失效反馈官方服务:
资源简介:
# ExpressiveSpeech Dataset
[**Project Webpage**](https://freedomintelligence.github.io/ExpressiveSpeech/)
[**中文版 (Chinese Version)**](./README_zh.md)
## About The Dataset
**ExpressiveSpeech** is a high-quality, **expressive**, and **bilingual** (Chinese-English) speech dataset created to address the common lack of consistent vocal expressiveness in existing dialogue datasets.
This dataset is meticulously curated from five renowned open-source emotional dialogue datasets: Expresso, NCSSD, M3ED, MultiDialog, and IEMOCAP. Through a rigorous processing and selection pipeline, ExpressiveSpeech ensures that every utterance meets high standards for both acoustic quality and expressive richness. It is designed for tasks in expressive Speech-to-Speech (S2S), Text-to-Speech (TTS), voice conversion, speech emotion recognition, and other fields requiring high-fidelity, emotionally resonant audio.
## Key Features
- **High Expressiveness**: Achieves a significantly high average expressiveness score of **80.2** by **DeEAR**, far surpassing the original source datasets.
- **Bilingual Content**: Contains a balanced mix of Chinese and English speech, with a language ratio close to **1:1**.
- **Substantial Scale**: Comprises approximately **14,000 utterances**, totaling **51 hours** of audio.
- **Rich Metadata**: Includes ASR-generated text transcriptions, expressiveness scores, and source information for each utterance.
## Dataset Statistics
| Metric | Value |
| :--- | :--- |
| Total Utterances | ~14,000 |
| Total Duration | ~51 hours |
| Languages | Chinese, English |
| Language Ratio (CN:EN) | Approx. 1:1 |
| Sampling Rate | 16kHz |
| Avg. Expressiveness Score (DeEAR) | 80.2 |
## Our Expressiveness Scoring Tool: DeEAR
The high expressiveness of this dataset was achieved using our screening tool, **DeEAR**. If you need to build larger batches of high-expressiveness data yourself, you are welcome to use this tool. You can find it on our [GitHub](https://github.com/FreedomIntelligence/ExpressiveSpeech).
## Data Format
The dataset is organized as follows:
```
ExpressiveSpeech/
├── audio/
│ ├── M3ED
│ │ ├── audio_00001.wav
│ │ └── ...
│ ├── NCSSD
│ ├── IEMOCAP
│ ├── MultiDialog
│ └── Expresso
└── metadata.jsonl
```
- **`metadata.jsonl`**: A jsonl file containing detailed information for each utterance. The metadata includes:
- `audio_path`: The relative path to the audio file.
- `value`: The ASR-generated text transcription.
- `emotion`: Emotion labels from the original datasets.
- `expressiveness_scores`: The expressiveness score from the **DeEAR** model.
### JSONL Files Example
Each JSONL line contains a `conversations` field with an array of utterances.
Example:
```json
{"conversations": [{"No": 9, "from": "user", "value": "Yeah.", "emotion": "happy", "length": 2.027, "score_arousal": 0.9931480884552002, "score_prosody": 0.6800634264945984, "score_nature": 0.9687601923942566, "score_expressive": 0.9892677664756775, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/009_speaker1_53s_55s.wav"}, {"No": 10, "from": "assistant", "value": "What was the reason, what was the, why couldn't I get there, ah I forget.", "emotion": "happy", "length": 3.753, "score_arousal": 0.9555678963661194, "score_prosody": 0.6498672962188721, "score_nature": 1.030701756477356, "score_expressive": 0.9965837001800537, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/010_speaker2_55s_59s.wav"}]}
{"conversations": [{"No": 10, "from": "user", "value": "What was the reason, what was the, why couldn't I get there, ah I forget.", "emotion": "happy", "length": 3.753, "score_arousal": 0.9555678963661194, "score_prosody": 0.6498672962188721, "score_nature": 1.030701756477356, "score_expressive": 0.9965837001800537, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/010_speaker2_55s_59s.wav"}, {"No": 11, "from": "assistant", "value": "Because genie really had to go and and to the bathroom and she couldn't find a place to do it and so she when they put the tent on it it was it was a bad mess and they shouldn't have done that.", "emotion": "happy", "length": 10.649, "score_arousal": 0.976757287979126, "score_prosody": 0.7951533794403076, "score_nature": 0.9789049625396729, "score_expressive": 0.919080913066864, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/011_speaker1_58s_69s.wav"}]}
```
*Note*: Some source datasets applied VAD, which could split a single utterance into multiple segments. To maintain conversational integrity, we applied rules to merge such segments back into complete utterances.
## License
In line with the non-commercial restrictions of its source datasets, the ExpressiveSpeech dataset is released under the CC BY-NC-SA 4.0 license.
You can view the full license [here](https://creativecommons.org/licenses/by-nc-sa/4.0/).
## Citation
If you use this dataset in your research, please cite our paper:
```bibtex
@article{lin2025decoding,
title={Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment},
author={Lin, Zhiyu and Yang, Jingwen and Zhao, Jiale and Liu, Meng and Li, Sunzhu and Wang, Benyou},
journal={arXiv preprint arXiv:2510.20513},
year={2025}
}
```
# 富有表现力语音数据集(ExpressiveSpeech)
[**"项目主页"**](https://freedomintelligence.github.io/ExpressiveSpeech/)
[**"中文版(Chinese Version)"**](./README_zh.md)
## 数据集概述
**ExpressiveSpeech**是一款高质量、具备强表现力的双语(中英)语音数据集,旨在解决现有对话数据集普遍缺乏一致语音表现力的痛点。
本数据集精心遴选自五个知名开源情感对话数据集:Expresso、NCSSD、M3ED、MultiDialog与IEMOCAP。通过严格的处理与筛选流程,ExpressiveSpeech确保每条语音片段均达到声学质量与表现力丰富度的双重高标准。本数据集专为富有表现力的语音转语音(Speech-to-Speech, S2S)、文本转语音(Text-to-Speech, TTS)、语音转换、语音情感识别等需要高保真、富有情感共鸣音频的任务设计。
## 核心特性
- **高表现力**:通过DeEAR模型评测得到平均表现力得分高达**80.2**,远超原始源数据集。
- **双语内容**:包含均衡的中英语音混合,语言比例接近**1:1**。
- **规模可观**:包含约**14000条语音片段**,总时长达**51小时**。
- **丰富元数据**:为每条语音片段提供自动语音识别(Automatic Speech Recognition, ASR)生成的文本转录、表现力得分以及来源信息。
## 数据集统计数据
| 指标 | 数值 |
| :--- | :--- |
| 总语音片段数 | ~14,000 |
| 总时长 | ~51小时 |
| 支持语言 | 中文、英文 |
| 中英语言比例 | 约1:1 |
| 采样率 | 16kHz |
| DeEAR平均表现力得分 | 80.2 |
## 表现力评分工具:DeEAR
本数据集的高表现力得益于我们的筛选工具**DeEAR**。若您需要自行构建大规模高表现力语音数据,欢迎使用该工具。您可在其[**"GitHub仓库"**](https://github.com/FreedomIntelligence/ExpressiveSpeech)获取。
## 数据格式
本数据集的组织形式如下:
ExpressiveSpeech/
├── audio/
│ ├── M3ED
│ │ ├── audio_00001.wav
│ │ └── ...
│ ├── NCSSD
│ ├── IEMOCAP
│ ├── MultiDialog
│ └── Expresso
└── metadata.jsonl
- **`metadata.jsonl`**:用于存储每条语音片段详细信息的JSON Lines格式文件,元数据包含:
- `audio_path`:音频文件的相对路径
- `value`:自动语音识别(ASR)生成的文本转录内容
- `emotion`:原始数据集提供的情感标签
- `expressiveness_scores`:**DeEAR**模型输出的表现力得分
### JSONL格式文件示例
每一行JSONL条目包含一个`conversations`字段,其值为语音片段数组。示例如下:
json
{"conversations": [{"No": 9, "from": "user", "value": "Yeah.", "emotion": "happy", "length": 2.027, "score_arousal": 0.9931480884552002, "score_prosody": 0.6800634264945984, "score_nature": 0.9687601923942566, "score_expressive": 0.9892677664756775, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/009_speaker1_53s_55s.wav"}, {"No": 10, "from": "assistant", "value": "What was the reason, what was the, why couldn't I get there, ah I forget.", "emotion": "happy", "length": 3.753, "score_arousal": 0.9555678963661194, "score_prosody": 0.6498672962188721, "score_nature": 1.030701756477356, "score_expressive": 0.9965837001800537, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/010_speaker2_55s_59s.wav"}]}
{"conversations": [{"No": 10, "from": "user", "value": "What was the reason, what was the, why couldn't I get there, ah I forget.", "emotion": "happy", "length": 3.753, "score_arousal": 0.9555678963661194, "score_prosody": 0.6498672962188721, "score_nature": 1.030701756477356, "score_expressive": 0.9965837001800537, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/010_speaker2_55s_59s.wav"}, {"No": 11, "from": "assistant", "value": "Because genie really had to go and and to the bathroom and she couldn't find a place to do it and so she when they put the tent on it it was it was a bad mess and they shouldn't have done that.", "emotion": "happy", "length": 10.649, "score_arousal": 0.976757287979126, "score_prosody": 0.7951533794403076, "score_nature": 0.9789049625396729, "score_expressive": 0.919080913066864, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/011_speaker1_58s_69s.wav"}]}
*注*:部分源数据集曾应用语音活动检测(Voice Activity Detection, VAD),可能将单条完整语音拆分为多个片段。为保留对话完整性,我们通过规则将此类拆分片段重新合并为完整语音。
## 授权协议
遵循源数据集的非商业使用限制,ExpressiveSpeech数据集采用CC BY-NC-SA 4.0协议发布。您可在此处查看完整协议内容:https://creativecommons.org/licenses/by-nc-sa/4.0/
## 引用方式
若您在研究中使用本数据集,请引用以下论文:
bibtex
@article{lin2025decoding,
title={Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment},
author={Lin, Zhiyu and Yang, Jingwen and Zhao, Jiale and Liu, Meng and Li, Sunzhu and Wang, Benyou},
journal={arXiv preprint arXiv:2510.20513},
year={2025}
}
提供机构:
maas
创建时间:
2025-09-23



