ExpressiveSpeech

Name: ExpressiveSpeech
Creator: maas
Published: 2026-01-06 16:46:50
License: 暂无描述

魔搭社区2026-01-06 更新2025-09-27 收录

下载链接：

https://modelscope.cn/datasets/FreedomIntelligence/ExpressiveSpeech

下载链接

链接失效反馈

官方服务：

资源简介：

# ExpressiveSpeech Dataset [**Project Webpage**](https://freedomintelligence.github.io/ExpressiveSpeech/) [**中文版 (Chinese Version)**](./README_zh.md) ## About The Dataset **ExpressiveSpeech** is a high-quality, **expressive**, and **bilingual** (Chinese-English) speech dataset created to address the common lack of consistent vocal expressiveness in existing dialogue datasets. This dataset is meticulously curated from five renowned open-source emotional dialogue datasets: Expresso, NCSSD, M3ED, MultiDialog, and IEMOCAP. Through a rigorous processing and selection pipeline, ExpressiveSpeech ensures that every utterance meets high standards for both acoustic quality and expressive richness. It is designed for tasks in expressive Speech-to-Speech (S2S), Text-to-Speech (TTS), voice conversion, speech emotion recognition, and other fields requiring high-fidelity, emotionally resonant audio. ## Key Features - **High Expressiveness**: Achieves a significantly high average expressiveness score of **80.2** by **DeEAR**, far surpassing the original source datasets. - **Bilingual Content**: Contains a balanced mix of Chinese and English speech, with a language ratio close to **1:1**. - **Substantial Scale**: Comprises approximately **14,000 utterances**, totaling **51 hours** of audio. - **Rich Metadata**: Includes ASR-generated text transcriptions, expressiveness scores, and source information for each utterance. ## Dataset Statistics | Metric | Value | | :--- | :--- | | Total Utterances | ~14,000 | | Total Duration | ~51 hours | | Languages | Chinese, English | | Language Ratio (CN:EN) | Approx. 1:1 | | Sampling Rate | 16kHz | | Avg. Expressiveness Score (DeEAR) | 80.2 | ## Our Expressiveness Scoring Tool: DeEAR The high expressiveness of this dataset was achieved using our screening tool, **DeEAR**. If you need to build larger batches of high-expressiveness data yourself, you are welcome to use this tool. You can find it on our [GitHub](https://github.com/FreedomIntelligence/ExpressiveSpeech). ## Data Format The dataset is organized as follows: ``` ExpressiveSpeech/ ├── audio/ │ ├── M3ED │ │ ├── audio_00001.wav │ │ └── ... │ ├── NCSSD │ ├── IEMOCAP │ ├── MultiDialog │ └── Expresso └── metadata.jsonl ``` - **`metadata.jsonl`**: A jsonl file containing detailed information for each utterance. The metadata includes: - `audio_path`: The relative path to the audio file. - `value`: The ASR-generated text transcription. - `emotion`: Emotion labels from the original datasets. - `expressiveness_scores`: The expressiveness score from the **DeEAR** model. ### JSONL Files Example Each JSONL line contains a `conversations` field with an array of utterances. Example: ```json {"conversations": [{"No": 9, "from": "user", "value": "Yeah.", "emotion": "happy", "length": 2.027, "score_arousal": 0.9931480884552002, "score_prosody": 0.6800634264945984, "score_nature": 0.9687601923942566, "score_expressive": 0.9892677664756775, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/009_speaker1_53s_55s.wav"}, {"No": 10, "from": "assistant", "value": "What was the reason, what was the, why couldn't I get there, ah I forget.", "emotion": "happy", "length": 3.753, "score_arousal": 0.9555678963661194, "score_prosody": 0.6498672962188721, "score_nature": 1.030701756477356, "score_expressive": 0.9965837001800537, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/010_speaker2_55s_59s.wav"}]} {"conversations": [{"No": 10, "from": "user", "value": "What was the reason, what was the, why couldn't I get there, ah I forget.", "emotion": "happy", "length": 3.753, "score_arousal": 0.9555678963661194, "score_prosody": 0.6498672962188721, "score_nature": 1.030701756477356, "score_expressive": 0.9965837001800537, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/010_speaker2_55s_59s.wav"}, {"No": 11, "from": "assistant", "value": "Because genie really had to go and and to the bathroom and she couldn't find a place to do it and so she when they put the tent on it it was it was a bad mess and they shouldn't have done that.", "emotion": "happy", "length": 10.649, "score_arousal": 0.976757287979126, "score_prosody": 0.7951533794403076, "score_nature": 0.9789049625396729, "score_expressive": 0.919080913066864, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/011_speaker1_58s_69s.wav"}]} ``` *Note*: Some source datasets applied VAD, which could split a single utterance into multiple segments. To maintain conversational integrity, we applied rules to merge such segments back into complete utterances. ## License In line with the non-commercial restrictions of its source datasets, the ExpressiveSpeech dataset is released under the CC BY-NC-SA 4.0 license. You can view the full license [here](https://creativecommons.org/licenses/by-nc-sa/4.0/). ## Citation If you use this dataset in your research, please cite our paper: ```bibtex @article{lin2025decoding, title={Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment}, author={Lin, Zhiyu and Yang, Jingwen and Zhao, Jiale and Liu, Meng and Li, Sunzhu and Wang, Benyou}, journal={arXiv preprint arXiv:2510.20513}, year={2025} } ```

# 富有表现力语音数据集（ExpressiveSpeech） [**"项目主页"**](https://freedomintelligence.github.io/ExpressiveSpeech/) [**"中文版（Chinese Version）"**](./README_zh.md) ## 数据集概述 **ExpressiveSpeech**是一款高质量、具备强表现力的双语（中英）语音数据集，旨在解决现有对话数据集普遍缺乏一致语音表现力的痛点。本数据集精心遴选自五个知名开源情感对话数据集：Expresso、NCSSD、M3ED、MultiDialog与IEMOCAP。通过严格的处理与筛选流程，ExpressiveSpeech确保每条语音片段均达到声学质量与表现力丰富度的双重高标准。本数据集专为富有表现力的语音转语音（Speech-to-Speech, S2S）、文本转语音（Text-to-Speech, TTS）、语音转换、语音情感识别等需要高保真、富有情感共鸣音频的任务设计。 ## 核心特性 - **高表现力**：通过DeEAR模型评测得到平均表现力得分高达**80.2**，远超原始源数据集。 - **双语内容**：包含均衡的中英语音混合，语言比例接近**1:1**。 - **规模可观**：包含约**14000条语音片段**，总时长达**51小时**。 - **丰富元数据**：为每条语音片段提供自动语音识别（Automatic Speech Recognition, ASR）生成的文本转录、表现力得分以及来源信息。 ## 数据集统计数据 | 指标 | 数值 | | :--- | :--- | | 总语音片段数 | ~14,000 | | 总时长 | ~51小时 | | 支持语言 | 中文、英文 | | 中英语言比例 | 约1:1 | | 采样率 | 16kHz | | DeEAR平均表现力得分 | 80.2 | ## 表现力评分工具：DeEAR 本数据集的高表现力得益于我们的筛选工具**DeEAR**。若您需要自行构建大规模高表现力语音数据，欢迎使用该工具。您可在其[**"GitHub仓库"**](https://github.com/FreedomIntelligence/ExpressiveSpeech)获取。 ## 数据格式本数据集的组织形式如下： ExpressiveSpeech/ ├── audio/ │ ├── M3ED │ │ ├── audio_00001.wav │ │ └── ... │ ├── NCSSD │ ├── IEMOCAP │ ├── MultiDialog │ └── Expresso └── metadata.jsonl - **`metadata.jsonl`**：用于存储每条语音片段详细信息的JSON Lines格式文件，元数据包含： - `audio_path`：音频文件的相对路径 - `value`：自动语音识别（ASR）生成的文本转录内容 - `emotion`：原始数据集提供的情感标签 - `expressiveness_scores`：**DeEAR**模型输出的表现力得分 ### JSONL格式文件示例每一行JSONL条目包含一个`conversations`字段，其值为语音片段数组。示例如下： json {"conversations": [{"No": 9, "from": "user", "value": "Yeah.", "emotion": "happy", "length": 2.027, "score_arousal": 0.9931480884552002, "score_prosody": 0.6800634264945984, "score_nature": 0.9687601923942566, "score_expressive": 0.9892677664756775, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/009_speaker1_53s_55s.wav"}, {"No": 10, "from": "assistant", "value": "What was the reason, what was the, why couldn't I get there, ah I forget.", "emotion": "happy", "length": 3.753, "score_arousal": 0.9555678963661194, "score_prosody": 0.6498672962188721, "score_nature": 1.030701756477356, "score_expressive": 0.9965837001800537, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/010_speaker2_55s_59s.wav"}]} {"conversations": [{"No": 10, "from": "user", "value": "What was the reason, what was the, why couldn't I get there, ah I forget.", "emotion": "happy", "length": 3.753, "score_arousal": 0.9555678963661194, "score_prosody": 0.6498672962188721, "score_nature": 1.030701756477356, "score_expressive": 0.9965837001800537, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/010_speaker2_55s_59s.wav"}, {"No": 11, "from": "assistant", "value": "Because genie really had to go and and to the bathroom and she couldn't find a place to do it and so she when they put the tent on it it was it was a bad mess and they shouldn't have done that.", "emotion": "happy", "length": 10.649, "score_arousal": 0.976757287979126, "score_prosody": 0.7951533794403076, "score_nature": 0.9789049625396729, "score_expressive": 0.919080913066864, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/011_speaker1_58s_69s.wav"}]} *注*：部分源数据集曾应用语音活动检测（Voice Activity Detection, VAD），可能将单条完整语音拆分为多个片段。为保留对话完整性，我们通过规则将此类拆分片段重新合并为完整语音。 ## 授权协议遵循源数据集的非商业使用限制，ExpressiveSpeech数据集采用CC BY-NC-SA 4.0协议发布。您可在此处查看完整协议内容：https://creativecommons.org/licenses/by-nc-sa/4.0/ ## 引用方式若您在研究中使用本数据集，请引用以下论文： bibtex @article{lin2025decoding, title={Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment}, author={Lin, Zhiyu and Yang, Jingwen and Zhao, Jiale and Liu, Meng and Li, Sunzhu and Wang, Benyou}, journal={arXiv preprint arXiv:2510.20513}, year={2025} }

提供机构：

maas

创建时间：

2025-09-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集