VISION_LANGUAGE
收藏魔搭社区2025-10-09 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/microsoft/VISION_LANGUAGE
下载链接
链接失效反馈官方服务:
资源简介:
A key question for understanding multimodal vs. language capabilities of models is what is
the relative strength of the spatial reasoning and understanding in each modality, as spatial understanding is
expected to be a strength for multimodality? To test this we created a procedurally generatable, synthetic dataset
to testing spatial reasoning, navigation, and counting. These datasets are challenging and by
being procedurally generated new versions can easily be created to combat the effects of models being trained
on this data and the results being due to memorization. For each task, each question has an image and a text
representation that is sufficient for answering each question.
This dataset has three tasks that test: Spatial Understanding (Spatial-Map), Nav-
igation (Maze), and Counting (Spatial-Grid). Each task has three conditions, with respect to the input
modality, 1) text-only, input and a question, 2) vision-only, which is the standard task of visual-question an-
swering that consists of a vision-only input and a question, and 3) vision-text includes both text and image
representations with the question. Each condition includes 1500
images and text pairs for a total of 4500.
__Spatial Map__
The dataset consists of spatial relationships for random layouts of symbolic objects with text names on white background.
Each object is associated with a unique location name, such as Unicorn Umbrellas and Gale Gifts. To study the impact of modality,
the textual representation of each input consists of pairwise relations such as Brews Brothers Pub
is to the Southeast of Whale’s Watches. The questions include asking about the spatial
relationships between two locations and the number of objects that meet specific spatial criteria.
The dataset includes 3 conditions: text only, image only, and text+image. Each condition includes 1500 images and text pairs for a total of 4500.
There are 3 question types:
1) In which direction is one object to another (answer is a direction)
2) Which object is to the direction of another (answer is an object name)
3) How many objects are in a direction of another (answer is a number)
Each question is multiple choice.
__Maze__
The dataset consists of small mazes with questions asked about the maze. Each sample can be
represented as colored blocks where different colors signify distinct elements: a green block marks
the starting point (S), a red block indicates the exit (E), black blocks represent impassable walls,
white blocks denote navigable paths, and blue blocks trace the path from S to E. The objective is to
navigate from S to E following the blue path, with movement permitted in the four cardinal directions
(up, down, left, right). Alternatively, each input can be depicted in textual format using ASCII code.
The questions asked include counting the number of turns from S to E and determining the spatial relationship
between S and E.
The dataset includes 3 conditions: text only, image only, and text+image. Each condition includes 1500 images and text pairs for a total of 4500.
There are 3 question types:
1) How many right turns on the path from start to end (answer is a number)
2) How many total turns on the path from start to end (answer is a number)
3) Where is the exit releative to the start (answer is direction or yes/no)
Each question is multiple choice.
__Spatial Grid__
Each input consists of a grid of cells, each containing an image (e.g.,a rabbit). Alternatively, this grid
can also be represented in a purely textual format; for instance, the first row might be described as:
elephant | cat | giraffe | elephant | cat. The evaluations focus on tasks such as counting specific objects (e.g., rabbits) and
identifying the object located at a specific coordinate in the grid (e.g., first row, second column).
The dataset includes 3 conditions: text only, image only, and text+image. Each condition includes 1500 images and text pairs for a total of 4500 questions.
There are 3 question types:
1) How many blocks contain a specific animal (answer is a number)
2) What animal is in one specific block, adressed by top-left, top, right, etc. (answer is an animal name)
3) What animal is in one specific block, addressed by row, column (answer is an animal name)
Each question is multiple choice.
---
More details here: https://arxiv.org/pdf/2406.14852
要理解模型的多模态(multimodal)与语言能力,核心问题之一在于不同模态下空间推理(spatial reasoning)与理解的相对优势——而空间理解本被认为是多模态模型的强项。为验证这一点,我们构建了可程序化生成的合成数据集,用于测试空间推理、导航与计数能力。该数据集具备较高挑战性,且由于采用程序化生成方式,可轻松生成新样本以规避模型在训练数据上的过拟合与记忆效应带来的结果偏差。每个任务下的每道问题均配有可独立用于解答的图像与文本表征。
本数据集包含三类测试任务:空间理解(Spatial-Map)、导航(Maze)与计数(Spatial-Grid)。每类任务均基于输入模态设置三种实验条件:1)纯文本输入模式:仅提供文本输入与问题;2)纯视觉输入模式:即标准视觉问答(visual-question answering)任务,仅提供视觉输入与问题;3)图文混合模式:同时提供文本与图像表征及问题。每种条件下包含1500组图像-文本对,总计4500组。
**空间地图(Spatial Map)**
该数据集基于白色背景上带有文本名称的符号化对象的随机布局,构建对象间的空间关系。每个对象对应唯一的地点名称,例如"Unicorn Umbrellas"与"Gale Gifts"。为研究模态对性能的影响,每种输入的文本表征采用成对空间关系描述,例如"Brews Brothers Pub位于Whale’s Watches的东南方向"。问题类型包括查询两个地点间的空间方位关系,以及符合特定空间条件的对象数量。
本数据集同样包含三种实验条件:纯文本、纯图像以及图文混合,每种条件下包含1500组图像-文本对,总计4500组。问题共分为三类:
1. 某对象相对于另一对象的方位(答案为方位词)
2. 位于某对象指定方位的对象名称(答案为对象名)
3. 某对象指定方位内的对象数量(答案为数字)
所有问题均为选择题。
**迷宫(Maze)**
该数据集由小型迷宫及相关迷宫问题组成。每个样本可通过色块表征:不同颜色对应不同元素:绿色块代表起点(S),红色块代表终点(E),黑色块为不可通行的墙体,白色块为可通行路径,蓝色块则标记从S到E的最优路径。任务要求沿蓝色路径从起点S导航至终点E,仅允许沿四个基本方位(上下左右)移动。此外,每个输入也可通过ASCII码以文本形式呈现。问题包括统计从S到E的转弯次数,以及判断终点相对于起点的空间方位。
本数据集包含三种实验条件:纯文本、纯图像以及图文混合,每种条件下包含1500组图像-文本对,总计4500组。问题共分为三类:
1. 从起点到终点的右转次数(答案为数字)
2. 从起点到终点的总转弯次数(答案为数字)
3. 终点相对于起点的方位(答案为方位词或是/否)
所有问题均为选择题。
**空间网格(Spatial-Grid)**
每个输入为一个单元格网格,每个单元格内包含一张图像(例如兔子)。该网格也可通过纯文本形式表征,例如第一行可描述为:大象 | 猫 | 长颈鹿 | 大象 | 猫。评估任务包括统计特定对象(例如兔子)的数量,以及识别网格中指定坐标位置的对象(例如第一行第二列)。
本数据集包含三种实验条件:纯文本、纯图像以及图文混合,每种条件下包含1500组图像-文本对,总计4500组问题。问题共分为三类:
1. 网格中包含特定动物的单元格数量(答案为数字)
2. 通过方位(如左上、上方、右侧等)指定的单元格内的动物名称(答案为动物名)
3. 通过行、列坐标指定的单元格内的动物名称(答案为动物名)
所有问题均为选择题。
更多详细信息请参阅:https://arxiv.org/pdf/2406.14852
提供机构:
maas
创建时间:
2025-07-22



