VSI-Bench
收藏魔搭社区2026-05-14 更新2024-12-28 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/VSI-Bench
下载链接
链接失效反馈官方服务:
资源简介:
<!-- <div align="center"> -->
| Dataset | arXiv | Website | Code |
| :------ | :---- | :------ | :--- |
| **VSI-Bench** | <a href="https://arxiv.org/abs/2412.14171" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-thinking--in--space-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/thinking-in-space.github.io/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-thinking--in--space-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/thinking-in-space" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-thinking--in--space-white?&logo=github&logoColor=white" /></a> |
| **VSI-Bench-Debiased** | <a href="https://arxiv.org/abs/2511.04655" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-test--set--stress--test-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/test-set-training/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-test--set--stress--test-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/test-set-training" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-test--set--stress--test-white?&logo=github&logoColor=white" /></a> |
<!-- </div> -->
<br>
> [!IMPORTANT]
> ***[Nov. 7, 2025] UPDATE:** This Dataset has been updated to include a "Debiased" subset following the [TsT Pruning Methodology](https://vision-x-nyu.github.io/test-set-training/)*
<br>
# Visual-Spatial Intelligence Benchmark (VSI-Bench & VSI-Bench-Debiased)
This repository contains the visual spatial intelligence benchmark (VSI-Bench), introduced in [Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces](https://arxiv.org/abs/2412.14171), and its debiased counterpart **VSI-Bench-Debiased**, introduced in our follow-up work on systematic benchmark robustification [Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts](https://arxiv.org/abs/2511.04655).
## Overview
**VSI-Bench** evaluates visual-spatial intelligence of multimodal models through egocentric video understanding, comprising over 5,000 question-answer pairs from real-world indoor scenes.
**VSI-Bench-Debiased** is a robustified version that reduces non-visual shortcuts using our Test-set Stress-Test (TsT) and Iterative Bias Pruning (IBP) methodology. This version better isolates visual reasoning capabilities by systematically removing samples that can be solved without visual input.
### Description
VSI-Bench quantitatively evaluates the visual-spatial intelligence of MLLMs from egocentric video. VSI-Bench comprises over 5,000 question-answer pairs derived from 288 real videos. These videos are sourced from the validation sets of the public indoor 3D scene reconstruction datasets `ScanNet`, `ScanNet++`, and `ARKitScenes`, and represent diverse environments -- including residential spaces, professional settings (e.g., offices, labs), and industrial spaces (e.g., factories) and multiple geographic regions. By repurposing these existing 3D reconstruction and understanding datasets, VSI-Bench benefits from accurate object-level annotations, which are used in question generation and could support future studies exploring the connection between MLLMs and 3D reconstruction.
#### Fields
The dataset contains the following fields:
| Field Name | Description |
| :--------- | :---------- |
| `id` | Global index of the entry in the dataset |
| `dataset` | Video source: `scannet`, `arkitscenes` or `scannetpp` |
| `scene_name` | Scene (video) name for each question-answer pair |
| `question_type` | The type of task for question |
| `question` | Question asked about the video |
| `options` | Choices for the question (only for multiple choice questions) |
| `ground_truth` | Ground truth answer for the question |
| `pruned` | Boolean indicating if example was removed by Iterative Bias Pruning (IBP) |
### Why VSI-Bench-Debiased?
While the original VSI-Bench was designed to require visual understanding, our follow-up analysis revealed that a portion of questions could be answered using non-visual shortcuts—such as statistical biases in answer distributions or world knowledge priors—without actually processing the visual input.
**VSI-Bench-Debiased** addresses this through systematic robustification:
1. **Test-set Stress-Test (TsT)**: We applied k-fold cross-validation directly on the test set to identify samples with high non-visual solvability, assigning each sample a bias score.
2. **Iterative Bias Pruning (IBP)**: We iteratively removed samples with the highest bias scores, creating a subset that better compels genuine visual reasoning.
**Key improvements in VSI-Bench-Debiased:**
- **Reduced non-visual solvability**: Blind models (text-only, no vision) perform closer to chance
- **Wider vision-blind gap**: Greater performance difference between vision-enabled and vision-disabled models
- **Better isolation of visual reasoning**: Fine-tuning on in-distribution data improves vision-enabled performance much more than blind performance, confirming reduced shortcut reliance
For researchers interested in robust evaluation of visual-spatial intelligence, **we recommend reporting results on both the full and debiased subsets** to provide comprehensive assessment.
## Usage
### Dataset Configurations
This dataset provides three configurations for flexible evaluation:
| Config | Description | Usage |
|--------|-------------|-------|
| `full` (default) | All 5,131 examples with `pruned` column | Load all data, filter as needed |
| `debiased` | 2,363 examples (non-pruned subset) | Evaluate on robustified benchmark |
| `pruned` | 2,768 examples (pruned by IBP) | Analyze removed samples |
#### Loading the Dataset Annotations
##### Load specific configuration
If you want to load just a specific subset, you can use the config name with the `load_dataset` function as follows:
```python
from datasets import load_dataset
# Load full dataset (default)
vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench")
# or use the config name "full"
vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench", "full")
# Load debiased version only
vsi_bench_debiased = load_dataset("nyu-visionx/VSI-Bench", "debiased")
# Load pruned examples only
vsi_bench_pruned = load_dataset("nyu-visionx/VSI-Bench", "pruned")
```
##### Load full dataset and filter using `pruned` column (recommended)
> [!TIP]
> **For LMMS-Eval users:** We have updated the `vsi-bench` task to automatically report scores on both full and debiased subsets. (TODO: LINK).
We recommend loading the "full" set, evaluating on all samples, and then using the `pruned` column to compute scores on both the full and debiased subsets.
```python
from datasets import load_dataset
# Load full dataset with pruned annotations
vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench")
# Evaluate on full set
model_predictions = evaluate_model(vsi_bench_full)
# Score on both the full and debiased subsets
full_acc = compute_accuracy(model_predictions)
debiased_acc = compute_accuracy(model_predictions.filter(lambda x: not x["pruned"]))
```
### Evaluation
> [!TIP]
> ***TODO: link to the LMMS Eval Code***
VSI-Bench evaluates performance using two metrics: for multiple-choice questions, we use `Accuracy`, calculated based on exact matches. For numerical-answer questions, we introduce a new metric, `MRA (Mean Relative Accuracy)`, to assess how closely model predictions align with ground truth values.
We provide an out-of-the-box evaluation of VSI-Bench in our [GitHub repository](https://github.com/vision-x-nyu/thinking-in-space), including the [metrics](https://github.com/vision-x-nyu/thinking-in-space/blob/main/lmms_eval/tasks/vsibench/utils.py#L109C1-L155C36) implementation used in our framework. For further detailes, users can refer to our paper and GitHub repository.
## Files
- `test-*.parquet`: Parquet files containing dataset annotations (questions, answers, metadata).
* `test_debiased.parquet`: Annotations for the debiased subset (2,363 examples)
* `test_pruned.parquet`: Annotations for the pruned subset (2,768 examples)
- `*.zip`: Compressed video files for the dataset
* `arkitscenes.zip`: Videos for the ARKitScenes dataset
* `scannet.zip`: Videos for the ScanNet dataset
* `scannetpp.zip`: Videos for the ScanNet++ dataset
- `pruned_ids.txt`: List of example IDs removed by Iterative Bias Pruning
- `create_pq.py`: Convenience script to regenerate parquet files from `test.jsonl` and `pruned_ids.txt`. Can be run with `uv run create_pq.py`.
## Citation
If you use these datasets in your research, please cite the original VSI-Bench paper and our debiasing paper that produced VSI-Bench-Debiased:
```bibtex
@inproceedings{yang2025thinking,
title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}},
author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali and Han, Rilyn and Fei-Fei, Li and Xie, Saining},
booktitle={CVPR},
year={2025},
}
@article{brown2025benchmark,
title={{Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts}},
author={Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining},
year={2025},
journal={arXiv preprint arXiv:2511.04655},
}
```
<!-- <div align="center"> -->
| 数据集 | arXiv预印本 | 官方网站 | 代码仓库 |
| :------ | :---- | :------ | :--- |
| **VSI-Bench** | <a href="https://arxiv.org/abs/2412.14171" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-thinking--in--space-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/thinking-in-space.github.io/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-thinking--in--space-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/thinking-in-space" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-thinking--in--space-white?&logo=github&logoColor=white" /></a> |
| **VSI-Bench-Debiased** | <a href="https://arxiv.org/abs/2511.04655" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-test--set--stress--test-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/test-set-training/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-test--set--stress--test-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/test-set-training" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-test--set--stress--test-white?&logo=github&logoColor=white" /></a> |
<!-- </div> -->
<br>
> [!IMPORTANT]
> ***[2025年11月7日 更新]:** 本数据集已新增“去偏”子集,其遵循[TsT剪枝方法论](https://vision-x-nyu.github.io/test-set-training/)。*
<br>
# 视觉空间智能基准测试集(Visual-Spatial Intelligence Benchmark,简称VSI-Bench 与 VSI-Bench-Debiased)
本仓库包含发表于《Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces》(arXiv:2412.14171)的视觉空间智能基准测试集VSI-Bench,以及其后续工作中提出的去偏版本**VSI-Bench-Debiased**,该后续工作聚焦系统化基准测试集鲁棒化:《Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts》(arXiv:2511.04655)。
## 概述
**VSI-Bench** 通过以自我为中心的视频理解任务,评估多模态模型的视觉空间智能,包含来自真实室内场景的5000余条问答对。
**VSI-Bench-Debiased** 是鲁棒化后的版本,通过我们提出的测试集压力测试(Test-set Stress-Test, TsT)与迭代偏差剪枝(Iterative Bias Pruning, IBP)方法论,减少了非视觉捷径偏差,能够更精准地隔离视觉推理能力,系统性移除了无需视觉输入即可解答的样本。
### 数据集说明
VSI-Bench 从以自我为中心的视频视角,量化评估多模态大语言模型(Multimodal Large Language Model, MLLM)的视觉空间智能。该基准包含源自288个真实视频的5000余条问答对。这些视频取自公开室内3D场景重建数据集`ScanNet`、`ScanNet++`与`ARKitScenes`的验证集,涵盖多样化的环境类型,包括住宅空间、专业场景(如办公室、实验室)以及工业空间(如工厂),并覆盖多个地理区域。通过复用现有3D重建与理解数据集,VSI-Bench 可借助精准的物体级标注生成问答对,这一特性可支撑未来探索多模态大语言模型与3D重建之间关联的研究。
#### 数据字段
| 字段名 | 描述 |
| :--------- | :---------- |
| `id` | 数据集中条目的全局索引 |
| `dataset` | 视频来源:`scannet`、`arkitscenes` 或 `scannetpp` |
| `scene_name` | 每个问答对对应的场景(视频)名称 |
| `question_type` | 问答任务的类型 |
| `question` | 针对视频提出的问题 |
| `options` | 问题的选项(仅适用于选择题) |
| `ground_truth` | 问题的标准答案 |
| `pruned` | 布尔值,指示该样本是否被迭代偏差剪枝(Iterative Bias Pruning, IBP)移除 |
### 为何需要VSI-Bench-Debiased?
尽管原始VSI-Bench的设计目标是要求模型具备视觉理解能力,但后续分析发现,部分问题可通过非视觉捷径解答——例如答案分布的统计偏差或世界知识先验,而无需实际处理视觉输入。
**VSI-Bench-Debiased** 通过系统化鲁棒化解决了这一问题:
1. **测试集压力测试(Test-set Stress-Test, TsT)**:我们直接在测试集上应用k折交叉验证,以识别具有高非视觉可解性的样本,并为每个样本分配偏差分数。
2. **迭代偏差剪枝(Iterative Bias Pruning, IBP)**:我们迭代移除偏差分数最高的样本,生成了一个更能激发真正视觉推理的子集。
**VSI-Bench-Debiased 的核心改进:**
- **降低非视觉可解性**:仅使用文本的盲模型(无视觉输入)的表现更接近随机猜测水平
- **扩大视觉-盲模型差距**:具备视觉输入的模型与盲模型之间的性能差异更加显著
- **更好地隔离视觉推理**:在同分布数据上微调后,具备视觉输入的模型性能提升远高于盲模型,证实了其对非视觉捷径的依赖程度降低
对于希望稳健评估视觉空间智能的研究人员,**我们建议同时报告完整基准与去偏子集的结果**,以提供全面的评估。
## 使用方法
### 数据集配置
本数据集提供三种配置以支持灵活评估:
| 配置 | 描述 | 用途 |
|--------|-------------|-------|
| `full`(默认) | 全部5131条样本,包含`pruned`列 | 加载全部数据,按需进行过滤 |
| `debiased` | 2363条样本(未被剪枝的子集) | 在鲁棒化后的基准测试集上进行评估 |
| `pruned` | 2768条样本(被IBP剪枝的子集) | 分析被移除的样本 |
#### 加载数据集标注
##### 加载指定配置
若仅需加载特定子集,可通过`load_dataset`函数结合配置名称实现,示例如下:
python
from datasets import load_dataset
# 加载完整数据集(默认配置)
vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench")
# 或显式指定配置名称"full"
vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench", "full")
# 仅加载去偏版本
vsi_bench_debiased = load_dataset("nyu-visionx/VSI-Bench", "debiased")
# 仅加载被剪枝的样本
vsi_bench_pruned = load_dataset("nyu-visionx/VSI-Bench", "pruned")
##### 加载完整数据集并通过`pruned`列进行过滤(推荐)
> [!TIP]
> **针对LMMS-Eval用户:** 我们已更新`vsi-bench`任务,使其可自动报告完整子集与去偏子集的得分。(待补充链接)。
我们推荐加载“full”配置的完整数据集,对所有样本进行评估,再通过`pruned`列分别计算完整子集与去偏子集的得分。
python
from datasets import load_dataset
# 加载包含剪枝标注的完整数据集
vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench")
# 在完整数据集上评估模型
model_predictions = evaluate_model(vsi_bench_full)
# 分别计算完整子集与去偏子集的准确率
full_acc = compute_accuracy(model_predictions)
debiased_acc = compute_accuracy(model_predictions.filter(lambda x: not x["pruned"]))
### 模型评估
> [!TIP]
> ***待补充:LMMS Eval代码链接***
VSI-Bench 使用两种指标评估模型性能:对于选择题,采用准确率(Accuracy),基于精确匹配计算;对于数值答案类问题,我们提出了新的指标`MRA(平均相对准确率,Mean Relative Accuracy)`,用于评估模型预测值与标准答案的对齐程度。
我们在[GitHub仓库](https://github.com/vision-x-nyu/thinking-in-space)中提供了VSI-Bench的开箱即用评估方案,包括我们框架中使用的[指标实现](https://github.com/vision-x-nyu/thinking-in-space/blob/main/lmms_eval/tasks/vsibench/utils.py#L109C1-L155C36)。如需进一步细节,用户可参阅我们的论文与GitHub仓库。
## 文件说明
- `test-*.parquet`:包含数据集标注(问答对、元数据)的Parquet文件。
* `test_debiased.parquet`:去偏子集的标注(2363条样本)
* `test_pruned.parquet`:被剪枝子集的标注(2768条样本)
- `*.zip`:数据集的压缩视频文件
* `arkitscenes.zip`:ARKitScenes数据集的视频文件
* `scannet.zip`:ScanNet数据集的视频文件
* `scannetpp.zip`:ScanNet++数据集的视频文件
- `pruned_ids.txt`:被迭代偏差剪枝移除的样本ID列表
- `create_pq.py`:便捷脚本,可通过`uv run create_pq.py`运行,用于从`test.jsonl`与`pruned_ids.txt`重新生成Parquet文件。
## 引用说明
若您在研究中使用本数据集,请同时引用原始VSI-Bench论文与生成VSI-Bench-Debiased的去偏论文:
bibtex
@inproceedings{yang2025thinking,
title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}},
author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali and Han, Rilyn and Fei-Fei, Li and Xie, Saining},
booktitle={CVPR},
year={2025},
}
@article{brown2025benchmark,
title={{Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts}},
author={Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining},
year={2025},
journal={arXiv preprint arXiv:2511.04655},
}
提供机构:
maas
创建时间:
2024-12-26



