VRD (Visual Relationship Detection dataset)

Name: VRD (Visual Relationship Detection dataset)
Creator: OpenDataLab
Published: 2026-05-24 05:30:03
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/VRD

下载链接

链接失效反馈

官方服务：

资源简介：

一个包含 5000 张图像和 3799.3 万个关系的数据集。该数据集包含 100 个对象类别和 70 个将这些对象连接在一起的谓词类别。视觉关系捕捉图像中对象对之间的各种交互（例如“骑自行车的人”和“推自行车的人”）。因此，可能的关系集非常大，很难为所有可能的关系获得足够的训练样本。由于这个限制，以前关于视觉关系检测的工作集中在预测少数关系上。尽管大多数关系并不频繁，但它们的对象（例如“man”和“bicycle”）和谓词（例如“riding”和“push”）独立出现的频率更高。我们提出了一个模型，该模型使用这种洞察力单独训练对象和谓词的视觉模型，然后将它们组合在一起以预测每个图像的多个关系。我们通过利用语义词嵌入中的语言先验来微调预测关系的可能性，从而改进先前的工作。我们的模型可以扩展以从几个示例中预测数千种类型的关系。此外，我们将预测关系中的对象定位为图像中的边界框。我们进一步证明了理解关系可以改进基于内容的图像检索。

This is a dataset containing 5,000 images and 37.993 million relational triples. This dataset encompasses 100 object categories and 70 predicate categories that connect pairs of these objects. Visual relationships capture diverse interactions between object pairs in images, such as "person riding bicycle" and "person pushing bicycle". Consequently, the total set of possible relational triples is extremely large, making it infeasible to collect sufficient training samples for every potential combination. Due to this limitation, prior research on visual relationship detection has focused on predicting only a small subset of relational categories. Although most individual relational triples occur infrequently, their constituent components—namely objects (e.g., "man" and "bicycle") and predicates (e.g., "riding" and "pushing")—appear much more frequently in isolation. We propose a model that leverages this insight by first training separate visual models for objects and predicates, then combining these models to predict multiple relational triples for each input image. We further advance prior work by leveraging linguistic priors encoded in semantic word embeddings to fine-tune the likelihoods of the predicted relational triples. Our model can be scaled to predict thousands of distinct relational types using only a small number of training examples. Additionally, our model localizes the objects involved in each predicted relational triple as bounding boxes within the source image. Finally, we demonstrate that modeling visual relationships can further improve the performance of content-based image retrieval.

提供机构：

OpenDataLab

创建时间：

2022-04-29

搜集汇总

数据集介绍