five

ExpressiveSpeech

收藏
魔搭社区2026-01-06 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/FreedomIntelligence/ExpressiveSpeech
下载链接
链接失效反馈
官方服务:
资源简介:
# ExpressiveSpeech Dataset [**Project Webpage**](https://freedomintelligence.github.io/ExpressiveSpeech/) [**中文版 (Chinese Version)**](./README_zh.md) ## About The Dataset **ExpressiveSpeech** is a high-quality, **expressive**, and **bilingual** (Chinese-English) speech dataset created to address the common lack of consistent vocal expressiveness in existing dialogue datasets. This dataset is meticulously curated from five renowned open-source emotional dialogue datasets: Expresso, NCSSD, M3ED, MultiDialog, and IEMOCAP. Through a rigorous processing and selection pipeline, ExpressiveSpeech ensures that every utterance meets high standards for both acoustic quality and expressive richness. It is designed for tasks in expressive Speech-to-Speech (S2S), Text-to-Speech (TTS), voice conversion, speech emotion recognition, and other fields requiring high-fidelity, emotionally resonant audio. ## Key Features - **High Expressiveness**: Achieves a significantly high average expressiveness score of **80.2** by **DeEAR**, far surpassing the original source datasets. - **Bilingual Content**: Contains a balanced mix of Chinese and English speech, with a language ratio close to **1:1**. - **Substantial Scale**: Comprises approximately **14,000 utterances**, totaling **51 hours** of audio. - **Rich Metadata**: Includes ASR-generated text transcriptions, expressiveness scores, and source information for each utterance. ## Dataset Statistics | Metric | Value | | :--- | :--- | | Total Utterances | ~14,000 | | Total Duration | ~51 hours | | Languages | Chinese, English | | Language Ratio (CN:EN) | Approx. 1:1 | | Sampling Rate | 16kHz | | Avg. Expressiveness Score (DeEAR) | 80.2 | ## Our Expressiveness Scoring Tool: DeEAR The high expressiveness of this dataset was achieved using our screening tool, **DeEAR**. If you need to build larger batches of high-expressiveness data yourself, you are welcome to use this tool. You can find it on our [GitHub](https://github.com/FreedomIntelligence/ExpressiveSpeech). ## Data Format The dataset is organized as follows: ``` ExpressiveSpeech/ ├── audio/ │ ├── M3ED │ │ ├── audio_00001.wav │ │ └── ... │ ├── NCSSD │ ├── IEMOCAP │ ├── MultiDialog │ └── Expresso └── metadata.jsonl ``` - **`metadata.jsonl`**: A jsonl file containing detailed information for each utterance. The metadata includes: - `audio_path`: The relative path to the audio file. - `value`: The ASR-generated text transcription. - `emotion`: Emotion labels from the original datasets. - `expressiveness_scores`: The expressiveness score from the **DeEAR** model. ### JSONL Files Example Each JSONL line contains a `conversations` field with an array of utterances. Example: ```json {"conversations": [{"No": 9, "from": "user", "value": "Yeah.", "emotion": "happy", "length": 2.027, "score_arousal": 0.9931480884552002, "score_prosody": 0.6800634264945984, "score_nature": 0.9687601923942566, "score_expressive": 0.9892677664756775, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/009_speaker1_53s_55s.wav"}, {"No": 10, "from": "assistant", "value": "What was the reason, what was the, why couldn't I get there, ah I forget.", "emotion": "happy", "length": 3.753, "score_arousal": 0.9555678963661194, "score_prosody": 0.6498672962188721, "score_nature": 1.030701756477356, "score_expressive": 0.9965837001800537, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/010_speaker2_55s_59s.wav"}]} {"conversations": [{"No": 10, "from": "user", "value": "What was the reason, what was the, why couldn't I get there, ah I forget.", "emotion": "happy", "length": 3.753, "score_arousal": 0.9555678963661194, "score_prosody": 0.6498672962188721, "score_nature": 1.030701756477356, "score_expressive": 0.9965837001800537, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/010_speaker2_55s_59s.wav"}, {"No": 11, "from": "assistant", "value": "Because genie really had to go and and to the bathroom and she couldn't find a place to do it and so she when they put the tent on it it was it was a bad mess and they shouldn't have done that.", "emotion": "happy", "length": 10.649, "score_arousal": 0.976757287979126, "score_prosody": 0.7951533794403076, "score_nature": 0.9789049625396729, "score_expressive": 0.919080913066864, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/011_speaker1_58s_69s.wav"}]} ``` *Note*: Some source datasets applied VAD, which could split a single utterance into multiple segments. To maintain conversational integrity, we applied rules to merge such segments back into complete utterances. ## License In line with the non-commercial restrictions of its source datasets, the ExpressiveSpeech dataset is released under the CC BY-NC-SA 4.0 license. You can view the full license [here](https://creativecommons.org/licenses/by-nc-sa/4.0/). ## Citation If you use this dataset in your research, please cite our paper: ```bibtex @article{lin2025decoding, title={Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment}, author={Lin, Zhiyu and Yang, Jingwen and Zhao, Jiale and Liu, Meng and Li, Sunzhu and Wang, Benyou}, journal={arXiv preprint arXiv:2510.20513}, year={2025} } ```

# 富有表现力语音数据集(ExpressiveSpeech) [**"项目主页"**](https://freedomintelligence.github.io/ExpressiveSpeech/) [**"中文版(Chinese Version)"**](./README_zh.md) ## 数据集概述 **ExpressiveSpeech**是一款高质量、具备强表现力的双语(中英)语音数据集,旨在解决现有对话数据集普遍缺乏一致语音表现力的痛点。 本数据集精心遴选自五个知名开源情感对话数据集:Expresso、NCSSD、M3ED、MultiDialog与IEMOCAP。通过严格的处理与筛选流程,ExpressiveSpeech确保每条语音片段均达到声学质量与表现力丰富度的双重高标准。本数据集专为富有表现力的语音转语音(Speech-to-Speech, S2S)、文本转语音(Text-to-Speech, TTS)、语音转换、语音情感识别等需要高保真、富有情感共鸣音频的任务设计。 ## 核心特性 - **高表现力**:通过DeEAR模型评测得到平均表现力得分高达**80.2**,远超原始源数据集。 - **双语内容**:包含均衡的中英语音混合,语言比例接近**1:1**。 - **规模可观**:包含约**14000条语音片段**,总时长达**51小时**。 - **丰富元数据**:为每条语音片段提供自动语音识别(Automatic Speech Recognition, ASR)生成的文本转录、表现力得分以及来源信息。 ## 数据集统计数据 | 指标 | 数值 | | :--- | :--- | | 总语音片段数 | ~14,000 | | 总时长 | ~51小时 | | 支持语言 | 中文、英文 | | 中英语言比例 | 约1:1 | | 采样率 | 16kHz | | DeEAR平均表现力得分 | 80.2 | ## 表现力评分工具:DeEAR 本数据集的高表现力得益于我们的筛选工具**DeEAR**。若您需要自行构建大规模高表现力语音数据,欢迎使用该工具。您可在其[**"GitHub仓库"**](https://github.com/FreedomIntelligence/ExpressiveSpeech)获取。 ## 数据格式 本数据集的组织形式如下: ExpressiveSpeech/ ├── audio/ │ ├── M3ED │ │ ├── audio_00001.wav │ │ └── ... │ ├── NCSSD │ ├── IEMOCAP │ ├── MultiDialog │ └── Expresso └── metadata.jsonl - **`metadata.jsonl`**:用于存储每条语音片段详细信息的JSON Lines格式文件,元数据包含: - `audio_path`:音频文件的相对路径 - `value`:自动语音识别(ASR)生成的文本转录内容 - `emotion`:原始数据集提供的情感标签 - `expressiveness_scores`:**DeEAR**模型输出的表现力得分 ### JSONL格式文件示例 每一行JSONL条目包含一个`conversations`字段,其值为语音片段数组。示例如下: json {"conversations": [{"No": 9, "from": "user", "value": "Yeah.", "emotion": "happy", "length": 2.027, "score_arousal": 0.9931480884552002, "score_prosody": 0.6800634264945984, "score_nature": 0.9687601923942566, "score_expressive": 0.9892677664756775, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/009_speaker1_53s_55s.wav"}, {"No": 10, "from": "assistant", "value": "What was the reason, what was the, why couldn't I get there, ah I forget.", "emotion": "happy", "length": 3.753, "score_arousal": 0.9555678963661194, "score_prosody": 0.6498672962188721, "score_nature": 1.030701756477356, "score_expressive": 0.9965837001800537, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/010_speaker2_55s_59s.wav"}]} {"conversations": [{"No": 10, "from": "user", "value": "What was the reason, what was the, why couldn't I get there, ah I forget.", "emotion": "happy", "length": 3.753, "score_arousal": 0.9555678963661194, "score_prosody": 0.6498672962188721, "score_nature": 1.030701756477356, "score_expressive": 0.9965837001800537, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/010_speaker2_55s_59s.wav"}, {"No": 11, "from": "assistant", "value": "Because genie really had to go and and to the bathroom and she couldn't find a place to do it and so she when they put the tent on it it was it was a bad mess and they shouldn't have done that.", "emotion": "happy", "length": 10.649, "score_arousal": 0.976757287979126, "score_prosody": 0.7951533794403076, "score_nature": 0.9789049625396729, "score_expressive": 0.919080913066864, "audio-path": "audios/Expresso/splitted_conversation/ex04-ex01/laughing/ex04-ex01_laughing_001/011_speaker1_58s_69s.wav"}]} *注*:部分源数据集曾应用语音活动检测(Voice Activity Detection, VAD),可能将单条完整语音拆分为多个片段。为保留对话完整性,我们通过规则将此类拆分片段重新合并为完整语音。 ## 授权协议 遵循源数据集的非商业使用限制,ExpressiveSpeech数据集采用CC BY-NC-SA 4.0协议发布。您可在此处查看完整协议内容:https://creativecommons.org/licenses/by-nc-sa/4.0/ ## 引用方式 若您在研究中使用本数据集,请引用以下论文: bibtex @article{lin2025decoding, title={Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment}, author={Lin, Zhiyu and Yang, Jingwen and Zhao, Jiale and Liu, Meng and Li, Sunzhu and Wang, Benyou}, journal={arXiv preprint arXiv:2510.20513}, year={2025} }
提供机构:
maas
创建时间:
2025-09-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作