five

lossminimilization/EMID-Emotion-Matching

收藏
Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/lossminimilization/EMID-Emotion-Matching
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_name: EMID-Emotion-Matching annotations_creators: - expert-generated language: - en license: cc-by-nc-sa-4.0 pretty_name: EMID Music ↔ Image Emotion Matching Pairs tags: - audio - music - image - multimodal - emotion - contrastive-learning task_categories: - audio-classification - image-classification - visual-question-answering --- # EMID-Emotion-Matching `orrzohar/EMID-Emotion-Matching` is a derived dataset built on top of the **Emotionally paired Music and Image Dataset (EMID)** from ECNU (`ecnu-aigc/EMID`). It is designed for *music ↔ image emotion matching* with Qwen-Omni–style models. Each example contains: - `audio`: mono waveform stored as `datasets.Audio` (HF Hub preview can play it) - `sampling_rate`: sampling rate used when decoding (typically 16 kHz) - `image`: a single image (`datasets.Image`) - `same`: `bool`, whether the audio and image are labeled with the **same** emotion - `emotion`: normalized image emotion tag (e.g. `amusement`, `excitement`) for positive pairs; empty string for negatives - `question`: natural-language question used to prompt the model (several templates are mixed) - `answer`: canonical supervision text (`yes - {emotion}` for positives, `no` for negatives) | column | type | description | | -------------- | ------------------------------- | ----------- | | `audio` | `datasets.Audio (16k mono)` | decoded waveform; HF UI can play it | | `sampling_rate`| `int32` | explicit sample rate mirrored beside the `Audio` column | | `image` | `datasets.Image` | PIL.Image-compatible object | | `same` | `bool` | `True` if the pair is emotion-aligned | | `emotion` | `string` | normalized emotion label for positives, `""` otherwise | | `question` | `string` | user prompt template | | `answer` | `string` | canonical supervision text (`yes - {emotion}` / `no`) | The original EMID row has one music clip and up to **three** tagged images (`Image1`, `Image2`, `Image3`). For each `(audio, image)` pair we create: - **1 positive example**: the audio and its own tagged image (`same = True`, `emotion = image_tag`) - **NEGATIVES_PER_POSITIVE = 1 negative example**: the same audio paired with an image drawn from a *different* emotion tag (`same = False`, `emotion = ""`) With `MAX_SOURCE_ROWS = 4000`, this yields ~24,000 examples (positives + negatives), which we then split into: - `train`: 19,200 examples - `test`: 4,800 examples ## Source Data (EMID) The base EMID dataset is described in: - **Emotionally paired Music and Image Dataset (EMID)** *Y. Guo, J. Li, et al.* arXiv:2308.07622 — "Emotionally paired Music and Image Dataset (EMID)" <https://arxiv.org/abs/2308.07622> EMID contains 10,738 unique music clips, each paired with three images in the same emotional category, plus rich annotations: - `Audio_Filename`: unique filename of the music clip - `genre`: letter A–M, one of 13 emotional categories - `feeling`: distribution of free-form feelings reported by listeners (% per feeling) - `emotion`: ratings on 11 emotional dimensions (1–9) - `Image{1,2,3}_filename`: matched image filenames - `Image{1,2,3}_tag`: image emotion category (e.g. `amusement`, `excitement`) - `Image{1,2,3}_text`: GIT-generated captions - `is_original_clip`: whether this is an original or expanded clip For more details, see the EMID README and the paper above. ## How This Derived Dataset Was Built The script `prepare_emid_pairs.py` performs the following steps offline: 1. Load `ecnu-aigc/EMID` (train split) and decode: - `Audio_Filename` with `Audio(decode=True)` - `Image{1,2,3}_filename` with `datasets.Image(decode=True)` 2. Optionally cap the number of source rows with `MAX_SOURCE_ROWS` (default 4000). 3. Build an **image pool** keyed by normalized emotion tags. 4. For each EMID row and each available image (up to 3 per row): - Create a positive pair `(audio, image, same=True, emotion=image_tag)`. - Sample `NEGATIVES_PER_POSITIVE` images from *different* emotion tags to form negatives. 5. Normalize the emotion strings (lowercase, replace spaces and punctuation with `_`). 6. Draw a random question from a small set of Qwen-style templates and attach it as `question`. 7. Store the mono waveform as `datasets.Audio` and the image as `datasets.Image` so that downstream scripts can call `datasets.load_dataset` without extra decoding logic. 8. Split into train/test with `TRAIN_FRACTION = 0.8`. This yields a simple, flat structure that is convenient for SFT / contrastive training with Qwen2.5-Omni (or other multimodal LMs), without re-doing negative sampling or audio/image decoding inside notebooks. ## Suggested Usage ```python from datasets import load_dataset ds = load_dataset("orrzohar/EMID-Emotion-Matching") train_ds = ds["train"] test_ds = ds["test"] ex = train_ds[0] audio = ex["audio"] # dict with "array" + "sampling_rate" sr = ex["sampling_rate"] # int image = ex["image"] # PIL.Image.Image same = ex["same"] # bool emotion = ex["emotion"] # str question = ex["question"] # str answer = ex["answer"] # str ``` In the Qwen-Omni demos, we typically: - Use `question` as the user prompt, - Provide `audio` and `image` as multimodal inputs, and - Supervise the model with the provided `answer` (or regenerate your own phrasing from `same`/`emotion`). ## License This derived dataset **inherits the license** from EMID: - **CC BY-NC-SA 4.0** (Attribution–NonCommercial–ShareAlike 4.0 International) You **must**: - Use the data only for **non-commercial** purposes. - Provide appropriate **attribution** to the EMID authors and this derived dataset. - Distribute derivative works under the **same license**. Please refer to the full license text for details: <https://creativecommons.org/licenses/by-nc-sa/4.0/> If you use this dataset in academic work, please cite the EMID paper and, if appropriate, this derived dataset as well.

数据集名称:EMID情感匹配(EMID-Emotion-Matching) 标注生成方式:专家生成 语言:英语(en) 许可证:CC BY-NC-SA 4.0 展示名称:EMID音乐↔图像情感匹配对 标签:音频(audio)、音乐(music)、图像(image)、多模态(multimodal)、情感(emotion)、对比学习(contrastive-learning) 任务类别:音频分类、图像分类、视觉问答(visual-question-answering) # EMID情感匹配数据集 `orrzohar/EMID-Emotion-Matching` 是一款衍生数据集,基于华东师范大学(ECNU)发布的**情感配对音乐与图像数据集(Emotionally paired Music and Image Dataset, EMID)`ecnu-aigc/EMID`**构建而成,专为适配Qwen-Omni系列模型的**音乐↔图像情感匹配**任务设计。 每个样本包含以下字段: - `audio`:单声道波形,以`datasets.Audio`格式存储(Hugging Face Hub预览页面可直接播放) - `sampling_rate`:解码时采用的采样率(通常为16 kHz) - `image`:单张图像,格式为`datasets.Image` - `same`:布尔值(bool),表示该音频与图像是否标注了**相同**的情感 - `emotion`:归一化后的图像情感标签(正样本对示例:`amusement`、`excitement`),负样本对则为空字符串 - `question`:用于模型提示的自然语言问题(混合了多种模板) - `answer`:标准监督文本(正样本对为`yes - {emotion}`,负样本对为`no`) | 字段名 | 数据类型 | 描述 | | -------------- | ----------------------------------- | ----------- | | `audio` | `datasets.Audio (16k mono)` | 解码后的波形文件;Hugging Face UI支持直接播放 | | `sampling_rate`| `int32` | 与`audio`字段配套的显式采样率 | | `image` | `datasets.Image` | 兼容PIL.Image的图像对象 | | `same` | `bool` | 若该样本为情感对齐的匹配对则为`True` | | `emotion` | `string` | 正样本对的归一化情感标签,负样本对则为`""` | | `question` | `string` | 用户提示模板 | | `answer` | `string` | 标准监督文本(`yes - {emotion}` / `no`) | 原始EMID数据集的每条数据包含1段音乐片段,以及最多3张带标注的图像(`Image1`、`Image2`、`Image3`)。我们为每一组`(音频, 图像)`匹配对生成: - **1个正样本**:音频与其自身对应的标注图像(`same = True`,`emotion = 图像标签`) - **每组正样本对应1个负样本**:同一段音频搭配来自**不同**情感类别的图像(`same = False`,`emotion = ""`) 当`MAX_SOURCE_ROWS = 4000`时,本数据集共生成约24,000条样本(含正、负样本),并划分为: - 训练集(train):19,200条样本 - 测试集(test):4,800条样本 ## 源数据集(EMID) 基础EMID数据集的相关信息如下: - **情感配对音乐与图像数据集(Emotionally paired Music and Image Dataset, EMID)** 作者:Y. Guo、J. Li 等 arXiv:2308.07622 — 《情感配对音乐与图像数据集(EMID)》 链接:<https://arxiv.org/abs/2308.07622> EMID数据集共包含10,738条独立音乐片段,每条片段均搭配3张同情感类别的图像,并附带丰富的标注信息: - `Audio_Filename`:音乐片段的唯一文件名 - `genre`:字母A至M,代表13个情感类别之一 - `feeling`:听众报告的自由式情感分布(每种情感占比) - `emotion`:11个情感维度的评分(范围1~9) - `Image{1,2,3}_filename`:匹配图像的文件名 - `Image{1,2,3}_tag`:图像的情感类别(示例:`amusement`、`excitement`) - `Image{1,2,3}_text`:由生成式图像字幕模型(Generative Image Captioning, GIT)生成的图像标题 - `is_original_clip`:标识该片段为原始片段还是扩展片段 更多细节请参阅EMID数据集的README文档与上述论文。 ## 该衍生数据集的构建流程 脚本`prepare_emid_pairs.py`通过以下离线步骤完成数据集构建: 1. 加载`ecnu-aigc/EMID`数据集的训练划分并进行解码: - 对`Audio_Filename`字段使用`Audio(decode=True)`进行解码 - 对`Image{1,2,3}_filename`字段使用`datasets.Image(decode=True)`进行解码 2. 可选通过`MAX_SOURCE_ROWS`参数限制源数据的行数(默认值为4000)。 3. 构建以归一化情感标签为键的**图像池**。 4. 针对每条EMID数据及其每一张可用图像(单条数据最多3张): - 生成一组正样本对`(音频, 图像, same=True, emotion=图像标签)`。 - 从**不同**情感类别中采样`NEGATIVES_PER_POSITIVE`张图像,生成负样本对。 5. 对情感字符串进行归一化处理(转为小写,将空格与标点替换为`_`)。 6. 从少量Qwen风格的模板中随机选取一条作为`question`字段。 7. 将单声道波形以`datasets.Audio`格式存储,图像以`datasets.Image`格式存储,以便下游脚本可直接调用`datasets.load_dataset`,无需额外解码逻辑。 8. 以`TRAIN_FRACTION = 0.8`的比例划分为训练集与测试集。 该数据集采用简洁的扁平化结构,便于使用Qwen2.5-Omni(或其他多模态大语言模型)进行监督微调(Supervised Fine-Tuning, SFT)与对比学习训练,无需在脚本中重复实现负采样或音/图像解码逻辑。 ## 推荐使用方式 python from datasets import load_dataset ds = load_dataset("orrzohar/EMID-Emotion-Matching") train_ds = ds["train"] test_ds = ds["test"] ex = train_ds[0] audio = ex["audio"] # 包含"array"与"sampling_rate"的字典 sr = ex["sampling_rate"] # 整数类型采样率 image = ex["image"] # PIL.Image.Image兼容对象 same = ex["same"] # 布尔值 emotion = ex["emotion"] # 字符串类型情感标签 question = ex["question"] # 字符串类型提示问题 answer = ex["answer"] # 字符串类型监督答案 在Qwen-Omni的演示场景中,我们通常会: - 将`question`作为用户提示词, - 将`audio`与`image`作为多模态输入, - 使用给定的`answer`对模型进行监督(或可根据`same`/`emotion`字段自行生成表述)。 ## 许可证 该衍生数据集**沿用EMID的许可证**: - **CC BY-NC-SA 4.0**(署名-非商业性使用-相同方式共享4.0国际版) 你**必须**遵守以下要求: - 仅将数据集用于**非商业**用途。 - 为EMID原作者与本衍生数据集提供适当的**署名**。 - 将衍生作品以**相同许可证**进行分发。 完整许可证文本请参阅:<https://creativecommons.org/licenses/by-nc-sa/4.0/> 若将本数据集用于学术研究,请引用EMID原论文,若合适也请一并引用本衍生数据集。
提供机构:
lossminimilization
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作