Extended datasets from MM-IMDB and Ads-Parallelity dataset with the features from Google Cloud Vision API

NIAID Data Ecosystem2026-03-14 收录

下载链接：

https://zenodo.org/record/7050923

下载链接

链接失效反馈

官方服务：

资源简介：

This is extended datasets from MM-IMDB [Arevalo+ ICLRW'17], Ads-Parallelity [Zhang+ BMVC'18] dataset with the features from Google Cloud Vision API. These datasets are stored in jsonl (JSON Lines) format. Abstract (from our paper): There is increasing interest in the use of multimodal data in various web applications, such as digital advertising and e-commerce. Typical methods for extracting important information from multimodal data rely on a mid-fusion architecture that combines the feature representations from multiple encoders. However, as the number of modalities increases, several potential problems with the mid-fusion model structure arise, such as an increase in the dimensionality of the concatenated multimodal features and missing modalities. To address these problems, we propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets (DM2S2). Our set-aware concept consists of three components that capture the relationships among multiple modalities: (a) a BERT-based encoder to handle the inter- and intra-order of elements in the sequences, (b) intra-modality residual attention (IntraMRA) to capture the importance of the elements in a modality, and (c) inter-modality residual attention (InterMRA) to enhance the importance of elements with modality-level granularity further. Our concept exhibits performance that is comparable to or better than the previous set-aware models. Furthermore, we demonstrate that the visualization of the learned InterMRA and IntraMRA weights can provide an interpretation of the prediction results. Dataset (MM-IMDB and Ads-Parallelity): We extended two multimodal datasets, namely, MM-IMDB [Arevalo+ ICLRW'17], Ads-Parallelity [Zhang+ BMVC'18] for the empirical experiments. The MM-IMDB dataset contains 25,925 movies with multiple labels (genres). We used the original split provided in the dataset and reported the F1 scores (micro, macro, and samples) of the test set. The Ads-Parallelity dataset contains 670 images and slogans from persuasive advertisements to understand the implicit relationship (parallel and non-parallel) between these two modalities. A binary classification task is used to predict whether the text and image in the same ad convey the same message. We transformed the following multimodal information (i.e., visual, textual, and categorical data) into textual tokens and fed these into our proposed model. We used the Google Cloud Vision API for the visual features to obtain the following four pieces of information as tokens: (1) text from the OCR, (2) category labels from the label detection, (3) object tags from the object detection, and (4) the number of faces from the facial detection. We input the labels and object detection results as a sequence in order of confidence, as obtained from the API. We describe the visual, textual, and categorical features of each dataset below. MM-IMDB: We used the title and plot of movies as the textual features, and the aforementioned API results based on poster images as visual features. Ads-Parallelity: We used the same API-based visual features as in MM-IMDB. Furthermore, we used textual and categorical features consisting of textual inputs of transcriptions and messages, and categorical inputs of natural and text concrete images.

本数据集为MM-IMDB[Arevalo+ ICLRW'17]与Ads-Parallelity[Zhang+ BMVC'18]的扩展版本，集成了谷歌云视觉API（Google Cloud Vision API）提取的特征。所有数据集均采用JSON Lines（jsonl）格式存储。 **摘要（引自本研究论文）**：当前，多模态数据在数字广告、电子商务等各类Web应用中的应用愈发受到关注。传统多模态信息提取方法多依托中间融合架构，结合多编码器生成的特征表征。然而，随着模态数量的增加，中间融合模型结构暴露出诸多潜在问题，例如拼接后的多模态特征维度攀升，以及模态缺失问题。为解决上述问题，我们提出一种全新概念，将多模态输入视为序列集合，即深度多模态序列集（Deep Multimodal Sequence Sets, DM²S²）。该集合感知框架包含三个用于捕捉多模态间关联的核心组件：(a) 基于Transformer的BERT编码器，用于处理序列内元素的交互与顺序关系；(b) 模态内残差注意力（Intra-modality Residual Attention, IntraMRA），用于捕捉单一模态内元素的重要性；(c) 模态间残差注意力（Inter-modality Residual Attention, InterMRA），进一步以模态粒度强化元素的重要性权重。我们的框架性能可与现有集合感知模型媲美甚至更优。此外，我们通过可视化学习得到的InterMRA与IntraMRA权重，实现了对预测结果的可解释性分析。 **数据集（MM-IMDB与Ads-Parallelity）**：为开展实证实验，我们扩展了两个多模态数据集：MM-IMDB[Arevalo+ ICLRW'17]与Ads-Parallelity[Zhang+ BMVC'18]。其中，MM-IMDB数据集包含25925部带有多标签（电影类型）的影片，我们沿用数据集自带的原始划分方式，并报告了测试集上的F1分数（微平均、宏平均与样本平均）。Ads-Parallelity数据集包含来自说服力广告的670组图像与标语，旨在探索两种模态间的隐式关联（平行与非平行关系），任务为二分类任务：预测同一广告中的文本与图像是否传递相同信息。我们将多模态信息（即视觉、文本与分类数据）转换为文本Token，并输入至我们提出的模型中。我们借助谷歌云视觉API提取视觉特征，得到四类可作为Token的信息：(1) 光学字符识别（Optical Character Recognition, OCR）提取的文本内容；(2) 标签检测生成的类别标签；(3) 目标检测得到的目标标签；(4) 人脸检测统计的人脸数量。我们按照API返回的置信度顺序，将类别标签与目标检测结果作为序列输入。下文将分别介绍两个数据集的视觉、文本与分类特征。 **MM-IMDB**：我们采用影片的标题与剧情作为文本特征，并将前述基于海报图像的API提取结果作为视觉特征。 **Ads-Parallelity**：我们采用与MM-IMDB一致的基于API的视觉特征。此外，我们使用的文本与分类特征包括转录文本与消息文本的输入，以及自然图像与文本具象图像的分类输入。

创建时间：

2023-02-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集