albertvillanova/visual-spatial-reasoning
收藏Hugging Face2022-12-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/albertvillanova/visual-spatial-reasoning
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language:
- en
language_creators:
- machine-generated
license:
- apache-2.0
multilinguality:
- monolingual
pretty_name: Visual Spatial Reasoning
size_categories:
- 10K<n<100K
source_datasets:
- original
tags: []
task_categories:
- image-classification
task_ids: []
---
# Dataset Card for Visual Spatial Reasoning
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://ltl.mmll.cam.ac.uk/
- **Repository:** https://github.com/cambridgeltl/visual-spatial-reasoning
- **Paper:** https://arxiv.org/abs/2205.00363
- **Leaderboard:** https://paperswithcode.com/sota/visual-reasoning-on-vsr
- **Point of Contact:** https://ltl.mmll.cam.ac.uk/
### Dataset Summary
The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False).
### Supported Tasks and Leaderboards
We test three baselines, all supported in huggingface. They are VisualBERT [(Li et al. 2019)](https://arxiv.org/abs/1908.03557), LXMERT [(Tan and Bansal, 2019)](https://arxiv.org/abs/1908.07490) and ViLT [(Kim et al. 2021)](https://arxiv.org/abs/2102.03334). The leaderboard can be checked at [Papers With Code](https://paperswithcode.com/sota/visual-reasoning-on-vsr).
model | random split | zero-shot
:-------------|:-------------:|:-------------:
*human* | *95.4* | *95.4*
VisualBERT | 57.4 | 54.0
LXMERT | **72.5** | **63.2**
ViLT | 71.0 | 62.4
### Languages
The language in the dataset is English as spoken by the annotators. The BCP-47 code for English is en. [`meta_data.csv`](https://github.com/cambridgeltl/visual-spatial-reasoning/tree/master/data/data_files/meta_data.jsonl) contains meta data of annotators.
## Dataset Structure
### Data Instances
Each line is an individual data point. Each `jsonl` file is of the following format:
```json
{"image": "000000050403.jpg", "image_link": "http://images.cocodataset.org/train2017/000000050403.jpg", "caption": "The teddy bear is in front of the person.", "label": 1, "relation": "in front of", "annotator_id": 31, "vote_true_validator_id": [2, 6], "vote_false_validator_id": []}
{"image": "000000401552.jpg", "image_link": "http://images.cocodataset.org/train2017/000000401552.jpg", "caption": "The umbrella is far away from the motorcycle.", "label": 0, "relation": "far away from", "annotator_id": 2, "vote_true_validator_id": [], "vote_false_validator_id": [2, 9, 1]}
```
### Data Fields
`image` denotes name of the image in COCO and `image_link` points to the image on the COCO server (so you can also access directly). `caption` is self-explanatory. `label` being `0` and `1` corresponds to False and True respectively. `relation` records the spatial relation used. `annotator_id` points to the annotator who originally wrote the caption. `vote_true_validator_id` and `vote_false_validator_id` are annotators who voted True or False in the second phase validation.
### Data Splits
The VSR corpus, after validation, contains 10,119 data points with high agreement. On top of these, we create two splits (1) random split and (2) zero-shot split. For random split, we randomly split all data points into train, development, and test sets. Zero-shot split makes sure that train, development and test sets have no overlap of concepts (i.e., if *dog* is in test set, it is not used for training and development). Below are some basic statistics of the two splits.
split | train | dev | test | total
:------|:--------:|:--------:|:--------:|:--------:
random | 7,083 | 1,012 | 2,024 | 10,119
zero-shot | 5,440 | 259 | 731 | 6,430
Check out [`data/`](https://github.com/cambridgeltl/visual-spatial-reasoning/tree/master/data) for more details.
## Dataset Creation
### Curation Rationale
Understanding spatial relations is fundamental to achieve intelligence. Existing vision-language reasoning datasets are great but they compose multiple types of challenges and can thus conflate different sources of error.
The VSR corpus focuses specifically on spatial relations so we can have accurate diagnosis and maximum interpretability.
### Source Data
#### Initial Data Collection and Normalization
**Image pair sampling.** MS COCO 2017 contains
123,287 images and has labelled the segmentation and classes of 886,284 instances (individual
objects). Leveraging the segmentation, we first
randomly select two concepts, then retrieve all images containing the two concepts in COCO 2017 (train and
validation sets). Then images that contain multiple instances of any of the concept are filtered
out to avoid referencing ambiguity. For the single-instance images, we also filter out any of the images with instance area size < 30, 000, to prevent extremely small instances. After these filtering steps,
we randomly sample a pair in the remaining images.
We repeat such process to obtain a large number of
individual image pairs for caption generation.
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
**Fill in the blank: template-based caption generation.** Given a pair of images, the annotator needs to come up with a valid caption that makes it correctly describing one image but incorrect for the other. In this way, the annotator could focus on the key difference of the two images (which should be spatial relation of the two objects of interest) and come up with challenging relation that differentiates the two. Similar paradigms are also used in the annotation of previous vision-language reasoning datasets such as NLVR2 (Suhr et al., 2017,
2019) and MaRVL (Liu et al., 2021). To regularise annotators from writing modifiers and differentiating the image pair with things beyond accurate spatial relations, we opt for a template-based classification task instead of free-form caption writing. Besides, the template-generated dataset can be easily categorised based on relations and their meta-categories.
The caption template has the format of “The
`OBJ1` (is) __ the `OBJ2`.”, and the annotators
are instructed to select a relation from a fixed set
to fill in the slot. The copula “is” can be omitted
for grammaticality. For example, for “contains”,
“consists of”, and “has as a part”, “is” should be
discarded in the template when extracting the final
caption.
The fixed set of spatial relations enable us to obtain the full control of the generation process. The
full list of used relations are listed in the table below. It
contains 71 spatial relations and is adapted from
the summarised relation table of Fagundes et al.
(2021). We made minor changes to filter out clearly
unusable relations, made relation names grammatical under our template, and reduced repeated relations.
In our final dataset, 65 out of the 71 available relations are actually included (the other 6 are
either not selected by annotators or are selected but
the captions did not pass the validation phase).
| Category | Spatial Relations |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
| Adjacency | Adjacent to, alongside, at the side of, at the right side of, at the left side of, attached to, at the back of, ahead of, against, at the edge of |
| Directional | Off, past, toward, down, deep down*, up*, away from, along, around, from*, into, to*, across, across from, through*, down from |
| Orientation | Facing, facing away from, parallel to, perpendicular to |
| Projective | On top of, beneath, beside, behind, left of, right of, under, in front of, below, above, over, in the middle of |
| Proximity | By, close to, near, far from, far away from |
| Topological | Connected to, detached from, has as a part, part of, contains, within, at, on, in, with, surrounding, among, consists of, out of, between, inside, outside, touching |
| Unallocated | Beyond, next to, opposite to, after*, among, enclosed by |
**Second-round Human Validation.** Every annotated data point is reviewed by at least
two additional human annotators (validators). In
validation, given a data point (consists of an image
and a caption), the validator gives either a True or
False label. We exclude data points that have <
2/3 validators agreeing with the original label.
In the guideline, we communicated to the validators that, for relations such as “left”/“right”, “in
front of”/“behind”, they should tolerate different
reference frame: i.e., if the caption is true from either the object’s or the viewer’s reference, it should
be given a True label. Only
when the caption is incorrect under all reference
frames, a False label is assigned. This adds
difficulty to the models since they could not naively
rely on relative locations of the objects in the images but also need to correctly identify orientations of objects to make the best judgement.
#### Who are the annotators?
Annotators are hired from [prolific.co](https://prolific.co). We
require them (1) have at least a bachelor’s degree,
(2) are fluent in English or native speaker, and (3)
have a >99% historical approval rate on the platform. All annotators are paid with an hourly salary
of 12 GBP. Prolific takes an extra 33% of service
charge and 20% VAT on the service charge.
For caption generation, we release the task with
batches of 200 instances and the annotator is required to finish a batch in 80 minutes. An annotator
cannot take more than one batch per day. In this
way we have a diverse set of annotators and can
also prevent annotators from being fatigued. For
second round validation, we group 500 data points
in one batch and an annotator is asked to label each
batch in 90 minutes.
In total, 24 annotators participated in caption
generation and 26 participated in validation. The
annotators have diverse demographic background:
they were born in 13 different countries; live in 13
different couturiers; and have 14 different nationalities. 57.4% of the annotators identify themselves
as females and 42.6% as males.
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
This project is licensed under the [Apache-2.0 License](https://github.com/cambridgeltl/visual-spatial-reasoning/blob/master/LICENSE).
### Citation Information
```bibtex
@article{Liu2022VisualSR,
title={Visual Spatial Reasoning},
author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier},
journal={ArXiv},
year={2022},
volume={abs/2205.00363}
}
```
### Contributions
Thanks to [@juletx](https://github.com/juletx) for adding this dataset.
---
标注创建者:
- 众包(crowdsourced)
语言:
- 英语(en)
语言生成方式:
- 机器生成(machine-generated)
许可证:
- Apache-2.0许可证
多语言属性:
- 单语言(monolingual)
友好名称:
- 视觉空间推理(Visual Spatial Reasoning)
样本规模类别:
- 10000 < 样本数 < 100000
源数据集:
- 原创(original)
标签:
- 无
任务类别:
- 图像分类(image-classification)
任务子类别:
- 无
---
# 视觉空间推理数据集卡片
## 目录
- [目录](#目录)
- [数据集描述](#数据集描述)
- [数据集摘要](#数据集摘要)
- [支持任务与排行榜](#支持任务与排行榜)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据实例](#数据实例)
- [数据字段](#数据字段)
- [数据划分](#数据划分)
- [数据集构建](#数据集构建)
- [标注原理](#标注原理)
- [源数据](#源数据)
- [标注](#标注)
- [个人与敏感信息](#个人与敏感信息)
- [数据使用注意事项](#数据使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可信息](#许可信息)
- [引用信息](#引用信息)
- [贡献](#贡献)
## 数据集描述
- **主页:** https://ltl.mmll.cam.ac.uk/
- **代码仓库:** https://github.com/cambridgeltl/visual-spatial-reasoning
- **论文:** https://arxiv.org/abs/2205.00363
- **排行榜:** https://paperswithcode.com/sota/visual-reasoning-on-vsr
- **联系方式:** https://ltl.mmll.cam.ac.uk/
### 数据集摘要
视觉空间推理(Visual Spatial Reasoning, VSR)语料库是一组带有真/假标签的图文对。每个标题描述图像中两个独立物体的空间关系,视觉语言模型(Vision-Language Model, VLM)需要判断该标题是否正确描述了图像内容,对应标签为真(True)或假(False)。
### 支持任务与排行榜
我们测试了三种在Hugging Face中支持的基准模型,分别是VisualBERT [(Li et al. 2019)](https://arxiv.org/abs/1908.03557)、LXMERT [(Tan and Bansal, 2019)](https://arxiv.org/abs/1908.07490)和ViLT [(Kim et al. 2021)](https://arxiv.org/abs/2102.03334)。排行榜可在[Papers With Code](https://paperswithcode.com/sota/visual-reasoning-on-vsr)查看。
| 模型 | 随机划分(random split) | 零样本(zero-shot) |
|:-----------|:-----------------------:|:------------------:|
| *人类(human)* | *95.4* | *95.4* |
| VisualBERT | 57.4 | 54.0 |
| LXMERT | **72.5** | **63.2** |
| ViLT | 71.0 | 62.4 |
### 语言
本数据集使用标注人员使用的英语,英语的BCP-47代码为`en`。[`meta_data.csv`](https://github.com/cambridgeltl/visual-spatial-reasoning/tree/master/data/data_files/meta_data.jsonl)包含标注人员的元数据。
## 数据集结构
### 数据实例
每行代表一个独立的数据点。每个`jsonl`文件格式如下:
json
{"image": "000000050403.jpg", "image_link": "http://images.cocodataset.org/train2017/000000050403.jpg", "caption": "泰迪熊在人的前方。", "label": 1, "relation": "in front of", "annotator_id": 31, "vote_true_validator_id": [2, 6], "vote_false_validator_id": []}
{"image": "000000401552.jpg", "image_link": "http://images.cocodataset.org/train2017/000000401552.jpg", "caption": "雨伞远离摩托车。", "label": 0, "relation": "far away from", "annotator_id": 2, "vote_true_validator_id": [], "vote_false_validator_id": [2, 9, 1]}
### 数据字段
`image`表示微软COCO 2017数据集中的图像名称,`image_link`指向COCO服务器上的图像链接(可直接访问)。`caption`含义不言自明。`label`为0和1分别对应假(False)和真(True)。`relation`记录所使用的空间关系。`annotator_id`指向最初编写标题的标注人员。`vote_true_validator_id`和`vote_false_validator_id`分别代表在第二阶段验证中投真(True)和假(False)标签的标注人员。
### 数据划分
经过验证的VSR语料库包含10119个具有高一致性的数据点。在此基础上,我们创建了两种划分方式:(1) 随机划分(random split)和(2) 零样本划分(zero-shot split)。对于随机划分,我们将所有数据点随机划分为训练集、开发集和测试集。零样本划分确保训练集、开发集和测试集之间没有概念重叠(即如果*狗*出现在测试集中,则不会在训练集和开发集中出现)。以下是两种划分方式的基本统计信息:
| 划分方式 | 训练集 | 开发集 | 测试集 | 总计 |
|:---------|:------:|:------:|:------:|:----:|
| 随机划分 | 7083 | 1012 | 2024 | 10119 |
| 零样本划分 | 5440 | 259 | 731 | 6430 |
更多细节可查看[`data/`](https://github.com/cambridgeltl/visual-spatial-reasoning/tree/master/data)。
## 数据集构建
### 标注原理
理解空间关系是实现智能的基础。现有的视觉语言推理数据集虽表现优异,但往往包含多种类型的挑战,因此可能混淆不同来源的误差。VSR语料库专门聚焦于空间关系,以便我们能够进行精准诊断并获得最大程度的可解释性。
### 源数据
#### 初始数据收集与标准化
**图像对采样**。微软COCO 2017数据集包含123287张图像,并为886284个实例(独立物体)标注了分割掩码和类别。我们利用分割掩码信息,首先随机选择两个概念,然后检索COCO 2017训练集与验证集中同时包含这两个概念的图像。随后过滤掉包含任意概念多个实例的图像,以避免指代歧义。对于仅包含单个实例的图像,我们还会过滤掉实例面积小于30000的图像,以防止出现极小的物体。完成上述过滤步骤后,我们从剩余图像中随机采样一对,重复该过程以获取大量用于标题生成的图像对。
#### 源语言生产者是谁?
[需要更多信息]
### 标注
#### 标注流程
**填空:基于模板的标题生成**。给定一对图像,标注人员需要编写一个有效的标题,使其正确描述其中一张图像,但无法正确描述另一张。通过这种方式,标注人员可以聚焦于两张图像的关键差异(即目标两个物体的空间关系),并设计出具有区分度的挑战性关系。此前的视觉语言推理数据集如NLVR2 (Suhr et al., 2017, 2019)和MaRVL (Liu et al., 2021)也采用了类似的标注范式。为了规范标注人员的创作,避免添加修饰词或使用非空间关系的特征来区分图像对,我们选择基于模板的分类任务而非自由格式的标题撰写。此外,基于模板生成的数据集可轻松根据关系及其元类别进行分类。
标题模板格式为:"The `OBJ1` (is) __ the `OBJ2`.",并要求标注人员从固定集合中选择一个关系填入插槽。为保证语法正确性,连系动词"is"可省略。例如,对于"contains"、"consists of"和"has as a part",在生成最终标题时应省略"is"。
固定的空间关系集合可让我们完全控制生成流程。所使用的完整关系列表如下表所示,该集合改编自Fagundes等人(2021)总结的关系表,包含71种空间关系。我们对其进行了小幅修改,过滤掉明显不适用的关系,使关系名称在我们的模板中符合语法规范,并减少重复关系。在最终的数据集中,71种可用关系中有65种被实际使用(其余6种要么未被标注人员选择,要么被选中但生成的标题未通过验证阶段)。
| 类别 | 空间关系 |
|:-----------|:-------------------------------------------------------------------------------------------------------------------------------------|
| 邻接关系 | 与...相邻、与...并排、在...侧边、在...右侧、在...左侧、与...附着、在...后方、在...前方、与...紧靠、在...边缘 |
| 方向关系 | 离开、经过、朝向、向下、深入下方*、向上*、远离、沿着、环绕、来自*、进入、朝向*、穿过、从...向下 |
| 方位关系 | 朝向、背向、与...平行、与...垂直 |
| 投影关系 | 在...上方、在...下方、在...旁边、在...后方、在...左侧、在...右侧、在...之下、在...前方、在...下方、在...上方、在...之上、在...中间 |
| 邻近关系 | 靠近、贴近、接近、远离、十分远离 |
| 拓扑关系 | 与...连接、与...分离、作为...的一部分、属于...的一部分、包含、在...内部、在...处、在...上、在...里、带有、环绕、在...之中、由...组成、从...中出来、在...之间、在...内部、在...外部、与...接触 |
| 未分类 | 超越、在...旁边、与...相对、在...之后*、在...之中、被...包围 |
**第二阶段人工验证**。每个标注数据点都会由至少两名额外的人工标注人员(验证者)进行审核。在验证阶段,给定一个数据点(包含图像和标题),验证者需给出真(True)或假(False)标签。我们会排除那些同意原始标签的验证者比例低于2/3的数据点。
在标注指南中,我们告知验证者,对于"左"/"右"、"在...前方"/"在...后方"这类关系,应容忍不同的参考框架:即如果标题从物体自身或观察者的参考框架来看均为真,则应标记为真。仅当标题在所有参考框架下均不正确时,才标记为假。这增加了模型的难度,因为模型不能仅依赖图像中物体的相对位置,还需要正确识别物体的朝向以做出最优判断。
#### 标注人员是谁?
标注人员从[prolific.co](https://prolific.co)招募。我们要求他们满足以下条件:(1) 至少拥有学士学位;(2) 英语流利或为母语使用者;(3) 在该平台上的历史通过率超过99%。所有标注人员的时薪为12英镑,Prolific平台会收取33%的服务费以及服务费20%的增值税。
对于标题生成任务,我们以每批200个实例的形式发布任务,要求标注人员在80分钟内完成一批次,且每位标注人员每日仅可参与一批次。这样可以确保标注人员的多样性,并防止标注人员疲劳。对于第二阶段验证任务,我们将每500个数据点分为一批次,要求标注人员在90分钟内完成一批次的标注。
总计有24名标注人员参与了标题生成任务,26名参与了验证任务。这些标注人员具有多样化的人口统计背景:出生于13个不同的国家,居住在13个不同的国家,拥有14种不同的国籍。其中57.4%的标注人员自认为女性,42.6%自认为男性。
### 个人与敏感信息
[需要更多信息]
## 数据使用注意事项
### 数据集的社会影响
[需要更多信息]
### 偏差讨论
[需要更多信息]
### 其他已知局限性
[需要更多信息]
## 附加信息
### 数据集维护者
[需要更多信息]
### 许可信息
本项目采用[Apache-2.0许可证](https://github.com/cambridgeltl/visual-spatial-reasoning/blob/master/LICENSE)授权。
### 引用信息
bibtex
@article{Liu2022VisualSR,
title={视觉空间推理},
author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier},
journal={ArXiv},
year={2022},
volume={abs/2205.00363}
}
### 贡献
感谢[@juletx](https://github.com/juletx)添加此数据集。
提供机构:
albertvillanova
原始信息汇总
数据集概述
名称: Visual Spatial Reasoning (VSR)
描述: VSR是一个包含图像-标题对的数据集,每个标题描述图像中两个独立对象的空间关系,并带有真/假标签。该数据集用于训练和评估视觉语言模型(VLM)判断标题是否正确描述图像的能力。
语言: 英语(en)
数据集大小: 10,000至100,000个数据点
数据结构:
- 数据实例: 每个数据点包括图像名称、图像链接、标题、标签、空间关系、注释者ID以及验证者投票ID。
- 数据字段: 包括
image(图像名称)、image_link(图像链接)、caption(标题)、label(标签,0为假,1为真)、relation(空间关系)、annotator_id(注释者ID)、vote_true_validator_id和vote_false_validator_id(验证者投票ID)。 - 数据分割: 数据集分为随机分割和零样本分割,分别用于不同的训练和测试需求。
任务: 图像分类
支持的模型: VisualBERT, LXMERT, ViLT
许可证: Apache-2.0
创建过程:
- 注释: 数据集通过众包方式进行注释,使用模板生成标题,并通过两轮人工验证确保数据质量。
- 源数据: 图像来自MS COCO 2017,通过特定的筛选和配对过程生成图像对。
使用注意事项:
- 数据集专注于空间关系的理解,以提高模型的诊断准确性和解释性。
- 数据集的创建考虑了注释者的多样性和专业性,确保注释的质量和公正性。
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个用于视觉空间推理的标注数据集,包含10,119个图像-标题对,标题描述图像中两个对象之间的空间关系,并标记为正确或错误。数据集支持图像分类任务,并提供两种数据分割方式(随机分割和零样本分割),用于评估视觉语言模型的性能。
以上内容由遇见数据集搜集并总结生成



