lossminimilization/EMID-Emotion-Matching

Name: lossminimilization/EMID-Emotion-Matching
Creator: lossminimilization
Published: 2026-03-20 14:27:51
License: 暂无描述

Hugging Face2026-03-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/lossminimilization/EMID-Emotion-Matching

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_name: EMID-Emotion-Matching annotations_creators: - expert-generated language: - en license: cc-by-nc-sa-4.0 pretty_name: EMID Music ↔ Image Emotion Matching Pairs tags: - audio - music - image - multimodal - emotion - contrastive-learning task_categories: - audio-classification - image-classification - visual-question-answering --- # EMID-Emotion-Matching `orrzohar/EMID-Emotion-Matching` is a derived dataset built on top of the **Emotionally paired Music and Image Dataset (EMID)** from ECNU (`ecnu-aigc/EMID`). It is designed for *music ↔ image emotion matching* with Qwen-Omni–style models. Each example contains: - `audio`: mono waveform stored as `datasets.Audio` (HF Hub preview can play it) - `sampling_rate`: sampling rate used when decoding (typically 16 kHz) - `image`: a single image (`datasets.Image`) - `same`: `bool`, whether the audio and image are labeled with the **same** emotion - `emotion`: normalized image emotion tag (e.g. `amusement`, `excitement`) for positive pairs; empty string for negatives - `question`: natural-language question used to prompt the model (several templates are mixed) - `answer`: canonical supervision text (`yes - {emotion}` for positives, `no` for negatives) | column | type | description | | -------------- | ------------------------------- | ----------- | | `audio` | `datasets.Audio (16k mono)` | decoded waveform; HF UI can play it | | `sampling_rate`| `int32` | explicit sample rate mirrored beside the `Audio` column | | `image` | `datasets.Image` | PIL.Image-compatible object | | `same` | `bool` | `True` if the pair is emotion-aligned | | `emotion` | `string` | normalized emotion label for positives, `""` otherwise | | `question` | `string` | user prompt template | | `answer` | `string` | canonical supervision text (`yes - {emotion}` / `no`) | The original EMID row has one music clip and up to **three** tagged images (`Image1`, `Image2`, `Image3`). For each `(audio, image)` pair we create: - **1 positive example**: the audio and its own tagged image (`same = True`, `emotion = image_tag`) - **NEGATIVES_PER_POSITIVE = 1 negative example**: the same audio paired with an image drawn from a *different* emotion tag (`same = False`, `emotion = ""`) With `MAX_SOURCE_ROWS = 4000`, this yields ~24,000 examples (positives + negatives), which we then split into: - `train`: 19,200 examples - `test`: 4,800 examples ## Source Data (EMID) The base EMID dataset is described in: - **Emotionally paired Music and Image Dataset (EMID)** *Y. Guo, J. Li, et al.* arXiv:2308.07622 — "Emotionally paired Music and Image Dataset (EMID)" <https://arxiv.org/abs/2308.07622> EMID contains 10,738 unique music clips, each paired with three images in the same emotional category, plus rich annotations: - `Audio_Filename`: unique filename of the music clip - `genre`: letter A–M, one of 13 emotional categories - `feeling`: distribution of free-form feelings reported by listeners (% per feeling) - `emotion`: ratings on 11 emotional dimensions (1–9) - `Image{1,2,3}_filename`: matched image filenames - `Image{1,2,3}_tag`: image emotion category (e.g. `amusement`, `excitement`) - `Image{1,2,3}_text`: GIT-generated captions - `is_original_clip`: whether this is an original or expanded clip For more details, see the EMID README and the paper above. ## How This Derived Dataset Was Built The script `prepare_emid_pairs.py` performs the following steps offline: 1. Load `ecnu-aigc/EMID` (train split) and decode: - `Audio_Filename` with `Audio(decode=True)` - `Image{1,2,3}_filename` with `datasets.Image(decode=True)` 2. Optionally cap the number of source rows with `MAX_SOURCE_ROWS` (default 4000). 3. Build an **image pool** keyed by normalized emotion tags. 4. For each EMID row and each available image (up to 3 per row): - Create a positive pair `(audio, image, same=True, emotion=image_tag)`. - Sample `NEGATIVES_PER_POSITIVE` images from *different* emotion tags to form negatives. 5. Normalize the emotion strings (lowercase, replace spaces and punctuation with `_`). 6. Draw a random question from a small set of Qwen-style templates and attach it as `question`. 7. Store the mono waveform as `datasets.Audio` and the image as `datasets.Image` so that downstream scripts can call `datasets.load_dataset` without extra decoding logic. 8. Split into train/test with `TRAIN_FRACTION = 0.8`. This yields a simple, flat structure that is convenient for SFT / contrastive training with Qwen2.5-Omni (or other multimodal LMs), without re-doing negative sampling or audio/image decoding inside notebooks. ## Suggested Usage ```python from datasets import load_dataset ds = load_dataset("orrzohar/EMID-Emotion-Matching") train_ds = ds["train"] test_ds = ds["test"] ex = train_ds[0] audio = ex["audio"] # dict with "array" + "sampling_rate" sr = ex["sampling_rate"] # int image = ex["image"] # PIL.Image.Image same = ex["same"] # bool emotion = ex["emotion"] # str question = ex["question"] # str answer = ex["answer"] # str ``` In the Qwen-Omni demos, we typically: - Use `question` as the user prompt, - Provide `audio` and `image` as multimodal inputs, and - Supervise the model with the provided `answer` (or regenerate your own phrasing from `same`/`emotion`). ## License This derived dataset **inherits the license** from EMID: - **CC BY-NC-SA 4.0** (Attribution–NonCommercial–ShareAlike 4.0 International) You **must**: - Use the data only for **non-commercial** purposes. - Provide appropriate **attribution** to the EMID authors and this derived dataset. - Distribute derivative works under the **same license**. Please refer to the full license text for details: <https://creativecommons.org/licenses/by-nc-sa/4.0/> If you use this dataset in academic work, please cite the EMID paper and, if appropriate, this derived dataset as well.

数据集名称：EMID情感匹配（EMID-Emotion-Matching）标注生成方式：专家生成语言：英语（en）许可证：CC BY-NC-SA 4.0 展示名称：EMID音乐↔图像情感匹配对标签：音频（audio）、音乐（music）、图像（image）、多模态（multimodal）、情感（emotion）、对比学习（contrastive-learning）任务类别：音频分类、图像分类、视觉问答（visual-question-answering） # EMID情感匹配数据集 `orrzohar/EMID-Emotion-Matching` 是一款衍生数据集，基于华东师范大学（ECNU）发布的**情感配对音乐与图像数据集（Emotionally paired Music and Image Dataset, EMID）`ecnu-aigc/EMID`**构建而成，专为适配Qwen-Omni系列模型的**音乐↔图像情感匹配**任务设计。每个样本包含以下字段： - `audio`：单声道波形，以`datasets.Audio`格式存储（Hugging Face Hub预览页面可直接播放） - `sampling_rate`：解码时采用的采样率（通常为16 kHz） - `image`：单张图像，格式为`datasets.Image` - `same`：布尔值（bool），表示该音频与图像是否标注了**相同**的情感 - `emotion`：归一化后的图像情感标签（正样本对示例：`amusement`、`excitement`），负样本对则为空字符串 - `question`：用于模型提示的自然语言问题（混合了多种模板） - `answer`：标准监督文本（正样本对为`yes - {emotion}`，负样本对为`no`） | 字段名 | 数据类型 | 描述 | | -------------- | ----------------------------------- | ----------- | | `audio` | `datasets.Audio (16k mono)` | 解码后的波形文件；Hugging Face UI支持直接播放 | | `sampling_rate`| `int32` | 与`audio`字段配套的显式采样率 | | `image` | `datasets.Image` | 兼容PIL.Image的图像对象 | | `same` | `bool` | 若该样本为情感对齐的匹配对则为`True` | | `emotion` | `string` | 正样本对的归一化情感标签，负样本对则为`""` | | `question` | `string` | 用户提示模板 | | `answer` | `string` | 标准监督文本（`yes - {emotion}` / `no`） | 原始EMID数据集的每条数据包含1段音乐片段，以及最多3张带标注的图像（`Image1`、`Image2`、`Image3`）。我们为每一组`(音频, 图像)`匹配对生成： - **1个正样本**：音频与其自身对应的标注图像（`same = True`，`emotion = 图像标签`） - **每组正样本对应1个负样本**：同一段音频搭配来自**不同**情感类别的图像（`same = False`，`emotion = ""`）当`MAX_SOURCE_ROWS = 4000`时，本数据集共生成约24,000条样本（含正、负样本），并划分为： - 训练集（train）：19,200条样本 - 测试集（test）：4,800条样本 ## 源数据集（EMID）基础EMID数据集的相关信息如下： - **情感配对音乐与图像数据集（Emotionally paired Music and Image Dataset, EMID）** 作者：Y. Guo、J. Li 等 arXiv:2308.07622 — 《情感配对音乐与图像数据集（EMID）》链接：<https://arxiv.org/abs/2308.07622> EMID数据集共包含10,738条独立音乐片段，每条片段均搭配3张同情感类别的图像，并附带丰富的标注信息： - `Audio_Filename`：音乐片段的唯一文件名 - `genre`：字母A至M，代表13个情感类别之一 - `feeling`：听众报告的自由式情感分布（每种情感占比） - `emotion`：11个情感维度的评分（范围1~9） - `Image{1,2,3}_filename`：匹配图像的文件名 - `Image{1,2,3}_tag`：图像的情感类别（示例：`amusement`、`excitement`） - `Image{1,2,3}_text`：由生成式图像字幕模型（Generative Image Captioning, GIT）生成的图像标题 - `is_original_clip`：标识该片段为原始片段还是扩展片段更多细节请参阅EMID数据集的README文档与上述论文。 ## 该衍生数据集的构建流程脚本`prepare_emid_pairs.py`通过以下离线步骤完成数据集构建： 1. 加载`ecnu-aigc/EMID`数据集的训练划分并进行解码： - 对`Audio_Filename`字段使用`Audio(decode=True)`进行解码 - 对`Image{1,2,3}_filename`字段使用`datasets.Image(decode=True)`进行解码 2. 可选通过`MAX_SOURCE_ROWS`参数限制源数据的行数（默认值为4000）。 3. 构建以归一化情感标签为键的**图像池**。 4. 针对每条EMID数据及其每一张可用图像（单条数据最多3张）： - 生成一组正样本对`(音频, 图像, same=True, emotion=图像标签)`。 - 从**不同**情感类别中采样`NEGATIVES_PER_POSITIVE`张图像，生成负样本对。 5. 对情感字符串进行归一化处理（转为小写，将空格与标点替换为`_`）。 6. 从少量Qwen风格的模板中随机选取一条作为`question`字段。 7. 将单声道波形以`datasets.Audio`格式存储，图像以`datasets.Image`格式存储，以便下游脚本可直接调用`datasets.load_dataset`，无需额外解码逻辑。 8. 以`TRAIN_FRACTION = 0.8`的比例划分为训练集与测试集。该数据集采用简洁的扁平化结构，便于使用Qwen2.5-Omni（或其他多模态大语言模型）进行监督微调（Supervised Fine-Tuning, SFT）与对比学习训练，无需在脚本中重复实现负采样或音/图像解码逻辑。 ## 推荐使用方式 python from datasets import load_dataset ds = load_dataset("orrzohar/EMID-Emotion-Matching") train_ds = ds["train"] test_ds = ds["test"] ex = train_ds[0] audio = ex["audio"] # 包含"array"与"sampling_rate"的字典 sr = ex["sampling_rate"] # 整数类型采样率 image = ex["image"] # PIL.Image.Image兼容对象 same = ex["same"] # 布尔值 emotion = ex["emotion"] # 字符串类型情感标签 question = ex["question"] # 字符串类型提示问题 answer = ex["answer"] # 字符串类型监督答案在Qwen-Omni的演示场景中，我们通常会： - 将`question`作为用户提示词， - 将`audio`与`image`作为多模态输入， - 使用给定的`answer`对模型进行监督（或可根据`same`/`emotion`字段自行生成表述）。 ## 许可证该衍生数据集**沿用EMID的许可证**： - **CC BY-NC-SA 4.0**（署名-非商业性使用-相同方式共享4.0国际版）你**必须**遵守以下要求： - 仅将数据集用于**非商业**用途。 - 为EMID原作者与本衍生数据集提供适当的**署名**。 - 将衍生作品以**相同许可证**进行分发。完整许可证文本请参阅：<https://creativecommons.org/licenses/by-nc-sa/4.0/> 若将本数据集用于学术研究，请引用EMID原论文，若合适也请一并引用本衍生数据集。

提供机构：

lossminimilization

5,000+

优质数据集

54 个

任务类型

进入经典数据集