lossminimilization/EMID-Emotion-Matching
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/lossminimilization/EMID-Emotion-Matching
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_name: EMID-Emotion-Matching
annotations_creators:
- expert-generated
language:
- en
license: cc-by-nc-sa-4.0
pretty_name: EMID Music ↔ Image Emotion Matching Pairs
tags:
- audio
- music
- image
- multimodal
- emotion
- contrastive-learning
task_categories:
- audio-classification
- image-classification
- visual-question-answering
---
# EMID-Emotion-Matching
`orrzohar/EMID-Emotion-Matching` is a derived dataset built on top of
the **Emotionally paired Music and Image Dataset (EMID)** from ECNU (`ecnu-aigc/EMID`).
It is designed for *music ↔ image emotion matching* with Qwen-Omni–style models.
Each example contains:
- `audio`: mono waveform stored as `datasets.Audio` (HF Hub preview can play it)
- `sampling_rate`: sampling rate used when decoding (typically 16 kHz)
- `image`: a single image (`datasets.Image`)
- `same`: `bool`, whether the audio and image are labeled with the **same** emotion
- `emotion`: normalized image emotion tag (e.g. `amusement`, `excitement`) for positive pairs; empty string for negatives
- `question`: natural-language question used to prompt the model (several templates are mixed)
- `answer`: canonical supervision text (`yes - {emotion}` for positives, `no` for negatives)
| column | type | description |
| -------------- | ------------------------------- | ----------- |
| `audio` | `datasets.Audio (16k mono)` | decoded waveform; HF UI can play it |
| `sampling_rate`| `int32` | explicit sample rate mirrored beside the `Audio` column |
| `image` | `datasets.Image` | PIL.Image-compatible object |
| `same` | `bool` | `True` if the pair is emotion-aligned |
| `emotion` | `string` | normalized emotion label for positives, `""` otherwise |
| `question` | `string` | user prompt template |
| `answer` | `string` | canonical supervision text (`yes - {emotion}` / `no`) |
The original EMID row has one music clip and up to **three** tagged images
(`Image1`, `Image2`, `Image3`). For each `(audio, image)` pair we create:
- **1 positive example**: the audio and its own tagged image (`same = True`, `emotion = image_tag`)
- **NEGATIVES_PER_POSITIVE = 1 negative example**: the same audio paired with an image drawn
from a *different* emotion tag (`same = False`, `emotion = ""`)
With `MAX_SOURCE_ROWS = 4000`, this yields ~24,000 examples (positives + negatives),
which we then split into:
- `train`: 19,200 examples
- `test`: 4,800 examples
## Source Data (EMID)
The base EMID dataset is described in:
- **Emotionally paired Music and Image Dataset (EMID)**
*Y. Guo, J. Li, et al.*
arXiv:2308.07622 — "Emotionally paired Music and Image Dataset (EMID)"
<https://arxiv.org/abs/2308.07622>
EMID contains 10,738 unique music clips, each paired with three images in the same
emotional category, plus rich annotations:
- `Audio_Filename`: unique filename of the music clip
- `genre`: letter A–M, one of 13 emotional categories
- `feeling`: distribution of free-form feelings reported by listeners (% per feeling)
- `emotion`: ratings on 11 emotional dimensions (1–9)
- `Image{1,2,3}_filename`: matched image filenames
- `Image{1,2,3}_tag`: image emotion category (e.g. `amusement`, `excitement`)
- `Image{1,2,3}_text`: GIT-generated captions
- `is_original_clip`: whether this is an original or expanded clip
For more details, see the EMID README and the paper above.
## How This Derived Dataset Was Built
The script `prepare_emid_pairs.py` performs the following steps offline:
1. Load `ecnu-aigc/EMID` (train split) and decode:
- `Audio_Filename` with `Audio(decode=True)`
- `Image{1,2,3}_filename` with `datasets.Image(decode=True)`
2. Optionally cap the number of source rows with `MAX_SOURCE_ROWS` (default 4000).
3. Build an **image pool** keyed by normalized emotion tags.
4. For each EMID row and each available image (up to 3 per row):
- Create a positive pair `(audio, image, same=True, emotion=image_tag)`.
- Sample `NEGATIVES_PER_POSITIVE` images from *different* emotion tags to form negatives.
5. Normalize the emotion strings (lowercase, replace spaces and punctuation with `_`).
6. Draw a random question from a small set of Qwen-style templates and attach it as `question`.
7. Store the mono waveform as `datasets.Audio` and the image as `datasets.Image` so
that downstream scripts can call `datasets.load_dataset` without extra decoding logic.
8. Split into train/test with `TRAIN_FRACTION = 0.8`.
This yields a simple, flat structure that is convenient for SFT / contrastive training
with Qwen2.5-Omni (or other multimodal LMs), without re-doing negative sampling or
audio/image decoding inside notebooks.
## Suggested Usage
```python
from datasets import load_dataset
ds = load_dataset("orrzohar/EMID-Emotion-Matching")
train_ds = ds["train"]
test_ds = ds["test"]
ex = train_ds[0]
audio = ex["audio"] # dict with "array" + "sampling_rate"
sr = ex["sampling_rate"] # int
image = ex["image"] # PIL.Image.Image
same = ex["same"] # bool
emotion = ex["emotion"] # str
question = ex["question"] # str
answer = ex["answer"] # str
```
In the Qwen-Omni demos, we typically:
- Use `question` as the user prompt,
- Provide `audio` and `image` as multimodal inputs, and
- Supervise the model with the provided `answer` (or regenerate your own phrasing from `same`/`emotion`).
## License
This derived dataset **inherits the license** from EMID:
- **CC BY-NC-SA 4.0** (Attribution–NonCommercial–ShareAlike 4.0 International)
You **must**:
- Use the data only for **non-commercial** purposes.
- Provide appropriate **attribution** to the EMID authors and this derived dataset.
- Distribute derivative works under the **same license**.
Please refer to the full license text for details:
<https://creativecommons.org/licenses/by-nc-sa/4.0/>
If you use this dataset in academic work, please cite the EMID paper and, if appropriate,
this derived dataset as well.
数据集名称:EMID情感匹配(EMID-Emotion-Matching)
标注生成方式:专家生成
语言:英语(en)
许可证:CC BY-NC-SA 4.0
展示名称:EMID音乐↔图像情感匹配对
标签:音频(audio)、音乐(music)、图像(image)、多模态(multimodal)、情感(emotion)、对比学习(contrastive-learning)
任务类别:音频分类、图像分类、视觉问答(visual-question-answering)
# EMID情感匹配数据集
`orrzohar/EMID-Emotion-Matching` 是一款衍生数据集,基于华东师范大学(ECNU)发布的**情感配对音乐与图像数据集(Emotionally paired Music and Image Dataset, EMID)`ecnu-aigc/EMID`**构建而成,专为适配Qwen-Omni系列模型的**音乐↔图像情感匹配**任务设计。
每个样本包含以下字段:
- `audio`:单声道波形,以`datasets.Audio`格式存储(Hugging Face Hub预览页面可直接播放)
- `sampling_rate`:解码时采用的采样率(通常为16 kHz)
- `image`:单张图像,格式为`datasets.Image`
- `same`:布尔值(bool),表示该音频与图像是否标注了**相同**的情感
- `emotion`:归一化后的图像情感标签(正样本对示例:`amusement`、`excitement`),负样本对则为空字符串
- `question`:用于模型提示的自然语言问题(混合了多种模板)
- `answer`:标准监督文本(正样本对为`yes - {emotion}`,负样本对为`no`)
| 字段名 | 数据类型 | 描述 |
| -------------- | ----------------------------------- | ----------- |
| `audio` | `datasets.Audio (16k mono)` | 解码后的波形文件;Hugging Face UI支持直接播放 |
| `sampling_rate`| `int32` | 与`audio`字段配套的显式采样率 |
| `image` | `datasets.Image` | 兼容PIL.Image的图像对象 |
| `same` | `bool` | 若该样本为情感对齐的匹配对则为`True` |
| `emotion` | `string` | 正样本对的归一化情感标签,负样本对则为`""` |
| `question` | `string` | 用户提示模板 |
| `answer` | `string` | 标准监督文本(`yes - {emotion}` / `no`) |
原始EMID数据集的每条数据包含1段音乐片段,以及最多3张带标注的图像(`Image1`、`Image2`、`Image3`)。我们为每一组`(音频, 图像)`匹配对生成:
- **1个正样本**:音频与其自身对应的标注图像(`same = True`,`emotion = 图像标签`)
- **每组正样本对应1个负样本**:同一段音频搭配来自**不同**情感类别的图像(`same = False`,`emotion = ""`)
当`MAX_SOURCE_ROWS = 4000`时,本数据集共生成约24,000条样本(含正、负样本),并划分为:
- 训练集(train):19,200条样本
- 测试集(test):4,800条样本
## 源数据集(EMID)
基础EMID数据集的相关信息如下:
- **情感配对音乐与图像数据集(Emotionally paired Music and Image Dataset, EMID)**
作者:Y. Guo、J. Li 等
arXiv:2308.07622 — 《情感配对音乐与图像数据集(EMID)》
链接:<https://arxiv.org/abs/2308.07622>
EMID数据集共包含10,738条独立音乐片段,每条片段均搭配3张同情感类别的图像,并附带丰富的标注信息:
- `Audio_Filename`:音乐片段的唯一文件名
- `genre`:字母A至M,代表13个情感类别之一
- `feeling`:听众报告的自由式情感分布(每种情感占比)
- `emotion`:11个情感维度的评分(范围1~9)
- `Image{1,2,3}_filename`:匹配图像的文件名
- `Image{1,2,3}_tag`:图像的情感类别(示例:`amusement`、`excitement`)
- `Image{1,2,3}_text`:由生成式图像字幕模型(Generative Image Captioning, GIT)生成的图像标题
- `is_original_clip`:标识该片段为原始片段还是扩展片段
更多细节请参阅EMID数据集的README文档与上述论文。
## 该衍生数据集的构建流程
脚本`prepare_emid_pairs.py`通过以下离线步骤完成数据集构建:
1. 加载`ecnu-aigc/EMID`数据集的训练划分并进行解码:
- 对`Audio_Filename`字段使用`Audio(decode=True)`进行解码
- 对`Image{1,2,3}_filename`字段使用`datasets.Image(decode=True)`进行解码
2. 可选通过`MAX_SOURCE_ROWS`参数限制源数据的行数(默认值为4000)。
3. 构建以归一化情感标签为键的**图像池**。
4. 针对每条EMID数据及其每一张可用图像(单条数据最多3张):
- 生成一组正样本对`(音频, 图像, same=True, emotion=图像标签)`。
- 从**不同**情感类别中采样`NEGATIVES_PER_POSITIVE`张图像,生成负样本对。
5. 对情感字符串进行归一化处理(转为小写,将空格与标点替换为`_`)。
6. 从少量Qwen风格的模板中随机选取一条作为`question`字段。
7. 将单声道波形以`datasets.Audio`格式存储,图像以`datasets.Image`格式存储,以便下游脚本可直接调用`datasets.load_dataset`,无需额外解码逻辑。
8. 以`TRAIN_FRACTION = 0.8`的比例划分为训练集与测试集。
该数据集采用简洁的扁平化结构,便于使用Qwen2.5-Omni(或其他多模态大语言模型)进行监督微调(Supervised Fine-Tuning, SFT)与对比学习训练,无需在脚本中重复实现负采样或音/图像解码逻辑。
## 推荐使用方式
python
from datasets import load_dataset
ds = load_dataset("orrzohar/EMID-Emotion-Matching")
train_ds = ds["train"]
test_ds = ds["test"]
ex = train_ds[0]
audio = ex["audio"] # 包含"array"与"sampling_rate"的字典
sr = ex["sampling_rate"] # 整数类型采样率
image = ex["image"] # PIL.Image.Image兼容对象
same = ex["same"] # 布尔值
emotion = ex["emotion"] # 字符串类型情感标签
question = ex["question"] # 字符串类型提示问题
answer = ex["answer"] # 字符串类型监督答案
在Qwen-Omni的演示场景中,我们通常会:
- 将`question`作为用户提示词,
- 将`audio`与`image`作为多模态输入,
- 使用给定的`answer`对模型进行监督(或可根据`same`/`emotion`字段自行生成表述)。
## 许可证
该衍生数据集**沿用EMID的许可证**:
- **CC BY-NC-SA 4.0**(署名-非商业性使用-相同方式共享4.0国际版)
你**必须**遵守以下要求:
- 仅将数据集用于**非商业**用途。
- 为EMID原作者与本衍生数据集提供适当的**署名**。
- 将衍生作品以**相同许可证**进行分发。
完整许可证文本请参阅:<https://creativecommons.org/licenses/by-nc-sa/4.0/>
若将本数据集用于学术研究,请引用EMID原论文,若合适也请一并引用本衍生数据集。
提供机构:
lossminimilization



