juletxara/visual-spatial-reasoning

Name: juletxara/visual-spatial-reasoning
Creator: juletxara
Published: 2022-08-11 20:11:21
License: 暂无描述

Hugging Face2022-08-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/juletxara/visual-spatial-reasoning

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language: - en language_creators: - machine-generated license: - apache-2.0 multilinguality: - monolingual pretty_name: Visual Spatial Reasoning size_categories: - 10K<n<100K source_datasets: - original tags: [] task_categories: - image-classification task_ids: [] --- # Dataset Card for Visual Spatial Reasoning ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://ltl.mmll.cam.ac.uk/ - **Repository:** https://github.com/cambridgeltl/visual-spatial-reasoning - **Paper:** https://arxiv.org/abs/2205.00363 - **Leaderboard:** https://paperswithcode.com/sota/visual-reasoning-on-vsr - **Point of Contact:** https://ltl.mmll.cam.ac.uk/ ### Dataset Summary The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False). ### Supported Tasks and Leaderboards We test three baselines, all supported in huggingface. They are VisualBERT [(Li et al. 2019)](https://arxiv.org/abs/1908.03557), LXMERT [(Tan and Bansal, 2019)](https://arxiv.org/abs/1908.07490) and ViLT [(Kim et al. 2021)](https://arxiv.org/abs/2102.03334). The leaderboard can be checked at [Papers With Code](https://paperswithcode.com/sota/visual-reasoning-on-vsr). model | random split | zero-shot :-------------|:-------------:|:-------------: *human* | *95.4* | *95.4* VisualBERT | 57.4 | 54.0 LXMERT | **72.5** | **63.2** ViLT | 71.0 | 62.4 ### Languages The language in the dataset is English as spoken by the annotators. The BCP-47 code for English is en. [`meta_data.csv`](https://github.com/cambridgeltl/visual-spatial-reasoning/tree/master/data/data_files/meta_data.jsonl) contains meta data of annotators. ## Dataset Structure ### Data Instances Each line is an individual data point. Each `jsonl` file is of the following format: ```json {"image": "000000050403.jpg", "image_link": "http://images.cocodataset.org/train2017/000000050403.jpg", "caption": "The teddy bear is in front of the person.", "label": 1, "relation": "in front of", "annotator_id": 31, "vote_true_validator_id": [2, 6], "vote_false_validator_id": []} {"image": "000000401552.jpg", "image_link": "http://images.cocodataset.org/train2017/000000401552.jpg", "caption": "The umbrella is far away from the motorcycle.", "label": 0, "relation": "far away from", "annotator_id": 2, "vote_true_validator_id": [], "vote_false_validator_id": [2, 9, 1]} ``` ### Data Fields `image` denotes name of the image in COCO and `image_link` points to the image on the COCO server (so you can also access directly). `caption` is self-explanatory. `label` being `0` and `1` corresponds to False and True respectively. `relation` records the spatial relation used. `annotator_id` points to the annotator who originally wrote the caption. `vote_true_validator_id` and `vote_false_validator_id` are annotators who voted True or False in the second phase validation. ### Data Splits The VSR corpus, after validation, contains 10,119 data points with high agreement. On top of these, we create two splits (1) random split and (2) zero-shot split. For random split, we randomly split all data points into train, development, and test sets. Zero-shot split makes sure that train, development and test sets have no overlap of concepts (i.e., if *dog* is in test set, it is not used for training and development). Below are some basic statistics of the two splits. split | train | dev | test | total :------|:--------:|:--------:|:--------:|:--------: random | 7,083 | 1,012 | 2,024 | 10,119 zero-shot | 5,440 | 259 | 731 | 6,430 Check out [`data/`](https://github.com/cambridgeltl/visual-spatial-reasoning/tree/master/data) for more details. ## Dataset Creation ### Curation Rationale Understanding spatial relations is fundamental to achieve intelligence. Existing vision-language reasoning datasets are great but they compose multiple types of challenges and can thus conflate different sources of error. The VSR corpus focuses specifically on spatial relations so we can have accurate diagnosis and maximum interpretability. ### Source Data #### Initial Data Collection and Normalization **Image pair sampling.** MS COCO 2017 contains 123,287 images and has labelled the segmentation and classes of 886,284 instances (individual objects). Leveraging the segmentation, we first randomly select two concepts, then retrieve all images containing the two concepts in COCO 2017 (train and validation sets). Then images that contain multiple instances of any of the concept are filtered out to avoid referencing ambiguity. For the single-instance images, we also filter out any of the images with instance area size < 30, 000, to prevent extremely small instances. After these filtering steps, we randomly sample a pair in the remaining images. We repeat such process to obtain a large number of individual image pairs for caption generation. #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process **Fill in the blank: template-based caption generation.** Given a pair of images, the annotator needs to come up with a valid caption that makes it correctly describing one image but incorrect for the other. In this way, the annotator could focus on the key difference of the two images (which should be spatial relation of the two objects of interest) and come up with challenging relation that differentiates the two. Similar paradigms are also used in the annotation of previous vision-language reasoning datasets such as NLVR2 (Suhr et al., 2017, 2019) and MaRVL (Liu et al., 2021). To regularise annotators from writing modifiers and differentiating the image pair with things beyond accurate spatial relations, we opt for a template-based classification task instead of free-form caption writing. Besides, the template-generated dataset can be easily categorised based on relations and their meta-categories. The caption template has the format of “The `OBJ1` (is) __ the `OBJ2`.”, and the annotators are instructed to select a relation from a fixed set to fill in the slot. The copula “is” can be omitted for grammaticality. For example, for “contains”, “consists of”, and “has as a part”, “is” should be discarded in the template when extracting the final caption. The fixed set of spatial relations enable us to obtain the full control of the generation process. The full list of used relations are listed in the table below. It contains 71 spatial relations and is adapted from the summarised relation table of Fagundes et al. (2021). We made minor changes to filter out clearly unusable relations, made relation names grammatical under our template, and reduced repeated relations. In our final dataset, 65 out of the 71 available relations are actually included (the other 6 are either not selected by annotators or are selected but the captions did not pass the validation phase). | Category | Spatial Relations | |-------------|-------------------------------------------------------------------------------------------------------------------------------------------------| | Adjacency | Adjacent to, alongside, at the side of, at the right side of, at the left side of, attached to, at the back of, ahead of, against, at the edge of | | Directional | Off, past, toward, down, deep down*, up*, away from, along, around, from*, into, to*, across, across from, through*, down from | | Orientation | Facing, facing away from, parallel to, perpendicular to | | Projective | On top of, beneath, beside, behind, left of, right of, under, in front of, below, above, over, in the middle of | | Proximity | By, close to, near, far from, far away from | | Topological | Connected to, detached from, has as a part, part of, contains, within, at, on, in, with, surrounding, among, consists of, out of, between, inside, outside, touching | | Unallocated | Beyond, next to, opposite to, after*, among, enclosed by | **Second-round Human Validation.** Every annotated data point is reviewed by at least two additional human annotators (validators). In validation, given a data point (consists of an image and a caption), the validator gives either a True or False label. We exclude data points that have < 2/3 validators agreeing with the original label. In the guideline, we communicated to the validators that, for relations such as “left”/“right”, “in front of”/“behind”, they should tolerate different reference frame: i.e., if the caption is true from either the object’s or the viewer’s reference, it should be given a True label. Only when the caption is incorrect under all reference frames, a False label is assigned. This adds difficulty to the models since they could not naively rely on relative locations of the objects in the images but also need to correctly identify orientations of objects to make the best judgement. #### Who are the annotators? Annotators are hired from [prolific.co](https://prolific.co). We require them (1) have at least a bachelor’s degree, (2) are fluent in English or native speaker, and (3) have a >99% historical approval rate on the platform. All annotators are paid with an hourly salary of 12 GBP. Prolific takes an extra 33% of service charge and 20% VAT on the service charge. For caption generation, we release the task with batches of 200 instances and the annotator is required to finish a batch in 80 minutes. An annotator cannot take more than one batch per day. In this way we have a diverse set of annotators and can also prevent annotators from being fatigued. For second round validation, we group 500 data points in one batch and an annotator is asked to label each batch in 90 minutes. In total, 24 annotators participated in caption generation and 26 participated in validation. The annotators have diverse demographic background: they were born in 13 different countries; live in 13 different couturiers; and have 14 different nationalities. 57.4% of the annotators identify themselves as females and 42.6% as males. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information This project is licensed under the [Apache-2.0 License](https://github.com/cambridgeltl/visual-spatial-reasoning/blob/master/LICENSE). ### Citation Information ```bibtex @article{Liu2022VisualSR, title={Visual Spatial Reasoning}, author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier}, journal={ArXiv}, year={2022}, volume={abs/2205.00363} } ``` ### Contributions Thanks to [@juletx](https://github.com/juletx) for adding this dataset.

提供机构：

juletxara

原始信息汇总

数据集概述

名称: Visual Spatial Reasoning (VSR)

描述: VSR数据集包含图像与描述空间关系的文本对的真假标签。每个文本描述图像中两个对象的空间关系，视觉语言模型需判断该描述是否正确。

语言: 英语 (en)

许可证: Apache-2.0

数据集大小: 10,000至100,000个数据点

任务类型: 图像分类

数据集结构

数据实例: 每个数据点包含图像名称、链接、描述文本、标签（真/假）、空间关系描述、标注者ID及验证者投票信息。

数据字段:

image: 图像名称
image_link: 图像链接
caption: 描述文本
label: 真假标签（1为真，0为假）
relation: 空间关系
annotator_id: 标注者ID
vote_true_validator_id: 验证者真投票ID
vote_false_validator_id: 验证者假投票ID

数据分割: 数据集分为随机分割和零样本分割，具体统计如下：

分割	训练	开发	测试	总计
随机	7,083	1,012	2,024	10,119
零样本	5,440	259	731	6,430

数据集创建

标注过程: 采用模板基础的描述生成，标注者需从固定集合中选择空间关系填入模板。每个标注点至少由两名额外标注者验证，确保至少2/3的验证者同意原始标签。

标注者: 标注者来自prolific.co，至少拥有学士学位，英语流利，历史批准率超过99%。标注者按小时计酬，每小时12英镑。

使用注意事项

许可证: 数据集遵循Apache-2.0许可证。

引用信息: bibtex @article{Liu2022VisualSR, title={Visual Spatial Reasoning}, author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier}, journal={ArXiv}, year={2022}, volume={abs/2205.00363} }

搜集汇总

数据集介绍

构建方式

在视觉语言推理领域，空间关系的理解是模型智能化的基石。Visual Spatial Reasoning（VSR）数据集的构建采用了严谨的双阶段人工标注流程。首先，基于MS COCO 2017数据集，通过筛选包含特定概念对的图像，并排除存在多实例或过小实例的样本，确保对象指代的清晰性。随后，标注者依据预定义的71种空间关系模板，为图像对生成描述性标题，旨在使标题准确描述其中一幅图像而错误描述另一幅，从而聚焦于空间关系的差异。最后，每个数据点均经过至少两名验证者的独立审核，仅保留标注一致性达到三分之二以上的样本，最终形成了包含10,119个高质量图像-标题对的数据集。

特点

该数据集的核心特征在于其专注于空间关系的细粒度评估，为视觉语言模型的诊断提供了高度可解释的基准。数据集中每个样本均包含一个图像、一个描述两个对象间空间关系的标题，以及一个判断标题正确与否的二值标签。其独特之处在于提供了两种划分方式：随机划分与零样本划分，后者确保训练集与测试集在概念上无重叠，从而能够评估模型对未见概念的泛化能力。此外，标注过程中考虑了不同参考框架（如物体自身与观察者视角）的容忍度，增加了任务的挑战性，避免了模型仅依赖简单的对象相对位置进行浅层推理。

使用方法

该数据集主要用于评估视觉语言模型在空间关系推理任务上的性能。研究者可通过Hugging Face平台便捷加载数据集，其数据以JSON Lines格式组织，包含图像链接、标题、标签及丰富的元数据。典型的使用流程包括：利用提供的图像链接或本地COCO图像数据加载视觉特征，结合标题文本构建多模态输入；随后，可基于VisualBERT、LXMERT或ViLT等基线模型进行训练与评估。数据集的两种划分支持不同的实验设置，随机划分用于常规性能评测，而零样本划分则用于检验模型的概念泛化能力。相关评测结果可在Papers with Code的官方排行榜上进行追踪与比较。

背景与挑战

背景概述

视觉空间推理（Visual Spatial Reasoning, VSR）数据集由剑桥大学语言技术实验室于2022年创建，旨在深化对视觉语言模型中空间关系理解能力的研究。该数据集聚焦于图像中两个物体间的空间关系描述，通过真伪标注的图文对形式，为模型提供精准的诊断工具。其核心研究问题在于评估模型对复杂空间语义的解析能力，如方向性、拓扑关系及投影关系等，从而推动跨模态推理领域的发展。该数据集的构建基于MS COCO图像库，采用众包标注与双重验证机制，确保了数据的高质量与可靠性，为视觉语言理解任务提供了重要的基准资源。

当前挑战

视觉空间推理数据集致力于解决视觉语言理解中空间关系判定的核心挑战，要求模型超越简单的物体识别，深入解析方位、距离及相对位置等复杂语义。构建过程中的挑战主要体现在标注设计上：为确保数据的一致性与可解释性，研究者采用了模板化标注方法，限制了标注者的自由发挥，但需平衡模板的灵活性与语义覆盖范围；同时，双重验证机制虽提升了标注质量，却引入了参考框架多样性问题，例如“左/右”等关系需考虑物体与观察者视角，增加了模型判断的难度。此外，数据分割中的零样本设置要求训练与测试集在概念上无重叠，这对模型的泛化能力提出了更高要求。

常用场景

经典使用场景

在视觉语言理解领域，空间关系推理是模型实现高级认知能力的关键环节。Visual Spatial Reasoning (VSR) 数据集通过精心设计的标注流程，构建了包含图像-描述对及其真伪标签的语料库，专门用于评估模型对物体间空间关系的理解能力。该数据集的核心应用场景在于为视觉语言模型提供标准化的基准测试平台，例如在随机划分和零样本划分两种设置下，研究者能够系统性地检验模型如VisualBERT、LXMERT和ViLT等在空间关系判断任务上的性能表现，从而揭示模型在细粒度视觉推理方面的优势与局限。

衍生相关工作

自VSR数据集发布以来，其已成为评估和推动视觉语言模型发展的重要基准。相关工作不仅限于报告中提及的VisualBERT、LXMERT和ViLT等基线模型对比，更激发了后续研究对模型空间推理机制的深入探索。例如，一些研究开始利用VSR分析多模态Transformer模型中空间知识的表征方式，或设计新的架构模块以专门增强方向性与拓扑关系理解。这些衍生工作共同促进了视觉与语言跨模态对齐技术的演进，并为构建更具常识推理能力的人工智能系统奠定了数据基础。

数据集最近研究