Emolia
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/laion/Emolia
下载链接
链接失效反馈官方服务:
资源简介:
### **Dataset Card for Emolia**
#### **Dataset Description**
This dataset is an enhanced version of the Emilia dataset, enriched with detailed emotion annotations. The annotations were generated using models from the EmoNet suite to provide deeper insight into the emotional content of speech. This work is based on the research and models described in the blog post "Do They See What We See?".
The annotations include 54 scores for each sample, covering a wide range of emotional and paralinguistic attributes, as well as an emotion caption generated by the BUD-E Whisper model. The goal is to enable more nuanced research and development in emotionally intelligent AI.
**Emolia** provides **timbre speaker embeddings** per sample using **Speaker-wavLM-tbr** (Orange), which focuses on **global timbral characteristics** and is intended to be **less sensitive to transient prosodic/emotional cues** than typical speaker-ID embeddings. Background: timbre/prosody disentanglement has recently been explored to improve speaker modeling robustness, aligning with this model’s goal. ([Hugging Face][1])
---
#### **Dataset Structure & Access**
The dataset includes the original Emilia audio data **plus**:
* **Emotion annotations** (40 categories + 14 attributes + emotion caption).
* **Timbre speaker embeddings** (vector per sample from Speaker-wavLM-tbr). ([Hugging Face][1])
* **Characters per second (CPS)** statistic per sample.
**Format:** WebDataset (sharded `.tar` files) containing audio and JSON/CSV sidecars for annotations, embeddings, and CPS.
> Note: This repository **supersedes** earlier, interim multi-repo distributions. All links to those prior repositories have been removed.
The original `.tar` files for the Emilia dataset are also included. Files belonging to the YODAS subset can be identified by a suffix in their filenames.
---
#### **Speaker Embeddings (Timbre-Focused)**
* **Model:** Orange/**Speaker-wavLM-tbr** (WavLM-based).
* **Intent:** Produce embeddings that **globally represent speaker timbre** and **de-emphasize emotion/prosody**, improving stability when emotional state varies. Use cosine similarity on embeddings for clustering/comparison, analogous to ASV, but with a **timbre emphasis**. ([Hugging Face][1])
* **Background:** WavLM is a self-supervised model family designed for robust speech representation and speaker identity preservation; disentangling timbre from prosody has been shown to improve robustness in several tasks. ([arXiv][2])
---
#### **Derived Metric: Characters per Second (CPS)**
For each utterance we compute **CPS = total\_characters / audio\_duration\_seconds**.
CPS values are stored alongside the other per-sample annotations in this repository.
---
#### **Dataset Statistics**
This combined dataset comprises approximately **215,600 hours** of speech, merging the original Emilia dataset with a large portion of the YODAS dataset. The inclusion of YODAS significantly expands the linguistic diversity and the total volume of data.
The language distribution is broken down as follows:
| Language | Emilia Duration (hours) | Emilia-YODAS Duration (hours) | Total Duration (hours) |
| :-------- | :---------------------- | :---------------------------- | :--------------------- |
| English | 46.8k | 92.2k | 139.0k |
| Chinese | 49.9k | 0.3k | 50.3k |
| German | 1.6k | 5.6k | 7.2k |
| French | 1.4k | 7.4k | 8.8k |
| Japanese | 1.7k | 1.1k | 2.8k |
| Korean | 0.2k | 7.3k | 7.5k |
| **Total** | **101.7k** | **113.9k** | **215.6k** |
---
#### **Interpretation of Scores**
The models predict raw scores for 40 emotional categories and 14 attribute dimensions. For the emotional categories, these raw scores are also used to calculate a normalized Softmax probability, indicating the relative likelihood of each emotion.
| Attribute | Range | Description |
| :-------------------- | :------- | :------------------------------------------------------- |
| **Valence** | -3 to +3 | -3: Ext. Negative, +3: Ext. Positive, 0: Neutral |
| **Arousal** | 0 to 4 | 0: Very Calm, 4: Very Excited, 2: Neutral |
| **Dominance** | -3 to +3 | -3: Ext. Submissive, +3: Ext. Dominant, 0: Neutral |
| **Age** | 0 to 6 | 0: Infant/Toddler, 2: Teenager, 4: Adult, 6: Very Old |
| **Gender** | -2 to +2 | -2: Very Masculine, +2: Very Feminine, 0: Neutral/Unsure |
| **Humor** | 0 to 4 | 0: Very Serious, 4: Very Humorous, 2: Neutral |
| **Detachment** | 0 to 4 | 0: Very Vulnerable, 4: Very Detached, 2: Neutral |
| **Confidence** | 0 to 4 | 0: Very Confident, 4: Very Hesitant, 2: Neutral |
| **Warmth** | -2 to +2 | -2: Very Cold, +2: Very Warm, 0: Neutral |
| **Expressiveness** | 0 to 4 | 0: Very Monotone, 4: Very Expressive, 2: Neutral |
| **Pitch** | 0 to 4 | 0: Very High-Pitched, 4: Very Low-Pitched, 2: Neutral |
| **Softness** | -2 to +2 | -2: Very Harsh, +2: Very Soft, 0: Neutral |
| **Authenticity** | 0 to 4 | 0: Very Artificial, 4: Very Genuine, 2: Neutral |
| **Recording Quality** | 0 to 4 | 0: Very Low, 4: Very High, 2: Decent |
| **Background Noise** | 0 to 3 | 0: No Noise, 3: Intense Noise |
---
#### **Citation**
If you use this dataset, please cite the original Emilia dataset paper as well as the EmoNet-Voice paper.
```bibtex
@inproceedings{emilialarge,
author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
title={Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation},
booktitle={arXiv:2501.15907},
year={2025}
}
@article{emonet_voice_2025,
author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Friedrich, Felix and Kraus, Maurice and Nadi, Kourosh and Nguyen, Huu and Kersting, Kristian and Auer, Sören},
title={EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection},
journal={arXiv preprint arXiv:2506.09827},
year={2025}
}
```
**Additional References (Speaker Embeddings Background):**
* **Orange/Speaker-wavLM-tbr** model card (timbre-focused speaker embeddings). ([Hugging Face][1])
* **WavLM**: self-supervised pre-training for robust speech & speaker identity preservation. ([arXiv][2])
* **Disentangling timbre and prosody embeddings** (Interspeech 2024). ([isca-archive.org][3])
### **Emolia 数据集卡片**
#### **数据集描述**
本数据集是Emilia数据集的增强版本,新增了细粒度情感标注。标注由EmoNet模型套件生成,旨在更深入地挖掘语音中的情感内涵。本工作基于博文《我们所见是否如他们所见?》中描述的研究与模型。
本次标注为每个样本提供了54项评分,覆盖广泛的情感与副语言属性,同时包含由BUD-E Whisper模型生成的情感字幕。本数据集的目标是为情感智能AI的精细化研究与开发提供支撑。
**Emolia** 为每个样本提供基于**Speaker-wavLM-tbr(Orange)**的**音色说话人嵌入(timbre speaker embeddings)**,该模型聚焦于**全局音色特征**,相较于典型的说话人识别嵌入,其对瞬时韵律/情感线索的敏感性更低。背景说明:近期学界探索了音色与韵律解耦的方法以提升说话人建模的鲁棒性,与本模型的设计目标相符。([Hugging Face][1])
---
#### **数据集结构与获取方式**
本数据集包含原始Emilia音频数据,新增内容如下:
* **情感标注**(40个类别+14个属性+情感字幕)
* **音色说话人嵌入**(每个样本对应Speaker-wavLM-tbr生成的向量)([Hugging Face][1])
* **每秒字符数(CPS)** 单样本统计量
**格式**:采用WebDataset(分块`.tar`文件)存储,包含音频文件以及用于标注、嵌入与CPS统计的JSON/CSV附属文件。
> 注:本仓库已取代早期的多仓库临时分发版本,所有此前相关仓库的链接均已移除。
原始Emilia数据集的`.tar`文件亦包含在本仓库中。YODAS子集的文件可通过文件名后缀进行识别。
---
#### **聚焦音色的说话人嵌入**
* **模型**:基于WavLM的Orange/**Speaker-wavLM-tbr**
* **设计意图**:生成能够全局表征说话人音色、弱化情感/韵律信息的嵌入,在说话人情感状态变化时提升嵌入的稳定性。可通过对嵌入计算余弦相似度进行聚类或对比,类自动说话人验证(Automatic Speaker Verification,ASV),但侧重音色维度。([Hugging Face][1])
* **背景**:WavLM是一类自监督模型家族,旨在实现鲁棒的语音表征与说话人身份保留;将音色与韵律解耦已被证实可提升多项任务的模型鲁棒性。([arXiv][2])
---
#### **衍生指标:每秒字符数(CPS)**
针对每个话语片段,我们计算 **CPS = 总字符数 / 音频时长(秒)**。
CPS值与其他单样本标注一同存储在本仓库中。
---
#### **数据集统计信息**
本合并数据集总计包含约**215,600小时**的语音数据,由原始Emilia数据集与大部分YODAS数据集合并而来。新增YODAS数据集后,本数据集的语言多样性与总数据规模均得到显著提升。
语言分布详情如下:
| 语言 | Emilia时长(小时) | Emilia-YODAS合并后时长(小时) | 总时长(小时) |
| :----- | :---------------- | :---------------------------- | :------------- |
| 英语 | 46.8k | 92.2k | 139.0k |
| 中文 | 49.9k | 0.3k | 50.3k |
| 德语 | 1.6k | 5.6k | 7.2k |
| 法语 | 1.4k | 7.4k | 8.8k |
| 日语 | 1.7k | 1.1k | 2.8k |
| 韩语 | 0.2k | 7.3k | 7.5k |
| **总计** | **101.7k** | **113.9k** | **215.6k** |
---
#### **评分释义**
模型为40个情感类别与14个属性维度预测原始评分。针对情感类别,这些原始评分还可用于计算归一化Softmax概率,以指示每种情感的相对出现可能性。
| 属性名称 | 取值范围 | 释义 |
| :------------------- | :------- | :------------------------------------------------------------------- |
| **效价(Valence)** | -3至+3 | -3:极度负面,+3:极度正面,0:中性 |
| **唤醒度(Arousal)**| 0至4 | 0:极度平静,4:极度兴奋,2:中性 |
| **支配度(Dominance)** | -3至+3 | -3:极度顺从,+3:极度支配,0:中性 |
| **年龄** | 0至6 | 0:婴儿/学步儿童,2:青少年,4:成年,6:老年 |
| **性别** | -2至+2 | -2:极具男性化,+2:极具女性化,0:中性/无法确定 |
| **幽默感** | 0至4 | 0:极度严肃,4:极具幽默感,2:中性 |
| **疏离感** | 0至4 | 0:极度脆弱,4:极度疏离,2:中性 |
| **自信度** | 0至4 | 0:极度自信,4:极度犹豫,2:中性 |
| **温暖度** | -2至+2 | -2:极度冷漠,+2:极度温暖,0:中性 |
| **表现力** | 0至4 | 0:极度单调,4:极具表现力,2:中性 |
| **音调** | 0至4 | 0:极高音调,4:极低音调,2:中性 |
| **柔和度** | -2至+2 | -2:极度刺耳,+2:极度柔和,0:中性 |
| **真实性** | 0至4 | 0:极度虚假,4:极度真诚,2:中性 |
| **录音质量** | 0至4 | 0:极低质量,4:极高质量,2:合格 |
| **背景噪音** | 0至3 | 0:无噪音,3:强烈噪音 |
---
#### **引用规范**
若您使用本数据集,请同时引用原始Emilia数据集论文与EmoNet-Voice论文。
bibtex
@inproceedings{emilialarge,
author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
title={Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation},
booktitle={arXiv:2501.15907},
year={2025}
}
@article{emonet_voice_2025,
author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Friedrich, Felix and Kraus, Maurice and Nadi, Kourosh and Nguyen, Huu and Kersting, Kristian and Auer, Sören},
title={EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection},
journal={arXiv preprint arXiv:2506.09827},
year={2025}
}
**额外参考资料(说话人嵌入背景)**:
* **Orange/Speaker-wavLM-tbr** 模型卡片(侧重音色的说话人嵌入)。([Hugging Face][1])
* **WavLM**:面向鲁棒语音与说话人身份保留的自监督预训练。([arXiv][2])
* **解耦音色与韵律嵌入**(Interspeech 2024)。([isca-archive.org][3])
提供机构:
maas
创建时间:
2025-10-02



