thomas-yanxin/vsi-bench-mirror
收藏Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/thomas-yanxin/vsi-bench-mirror
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- visual-question-answering
language:
- en
tags:
- Video
- Text
size_categories:
- 1K<n<10K
configs:
- config_name: full
data_files:
- split: test
path: "test*.parquet"
default: true
- config_name: debiased
data_files:
- split: test
path: "test_debiased.parquet"
- config_name: pruned
data_files:
- split: test
path: "test_pruned.parquet"
---
<!-- <div align="center"> -->
| Dataset | arXiv | Website | Code |
| :------ | :---- | :------ | :--- |
| **VSI-Bench** | <a href="https://arxiv.org/abs/2412.14171" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-thinking--in--space-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/thinking-in-space.github.io/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-thinking--in--space-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/thinking-in-space" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-thinking--in--space-white?&logo=github&logoColor=white" /></a> |
| **VSI-Bench-Debiased** | <a href="https://arxiv.org/abs/2511.04655" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-test--set--stress--test-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/test-set-training/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-test--set--stress--test-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/test-set-training" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-test--set--stress--test-white?&logo=github&logoColor=white" /></a> |
<!-- </div> -->
<br>
> [!IMPORTANT]
> ***[Nov. 7, 2025] UPDATE:** This Dataset has been updated to include a "Debiased" subset following the [TsT Pruning Methodology](https://vision-x-nyu.github.io/test-set-training/)*
<br>
# Visual-Spatial Intelligence Benchmark (VSI-Bench & VSI-Bench-Debiased)
This repository contains the visual spatial intelligence benchmark (VSI-Bench), introduced in [Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces](https://arxiv.org/abs/2412.14171), and its debiased counterpart **VSI-Bench-Debiased**, introduced in our follow-up work on systematic benchmark robustification [Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts](https://arxiv.org/abs/2511.04655).
## Overview
**VSI-Bench** evaluates visual-spatial intelligence of multimodal models through egocentric video understanding, comprising over 5,000 question-answer pairs from real-world indoor scenes.
**VSI-Bench-Debiased** is a robustified version that reduces non-visual shortcuts using our Test-set Stress-Test (TsT) and Iterative Bias Pruning (IBP) methodology. This version better isolates visual reasoning capabilities by systematically removing samples that can be solved without visual input.
### Description
VSI-Bench quantitatively evaluates the visual-spatial intelligence of MLLMs from egocentric video. VSI-Bench comprises over 5,000 question-answer pairs derived from 288 real videos. These videos are sourced from the validation sets of the public indoor 3D scene reconstruction datasets `ScanNet`, `ScanNet++`, and `ARKitScenes`, and represent diverse environments -- including residential spaces, professional settings (e.g., offices, labs), and industrial spaces (e.g., factories) and multiple geographic regions. By repurposing these existing 3D reconstruction and understanding datasets, VSI-Bench benefits from accurate object-level annotations, which are used in question generation and could support future studies exploring the connection between MLLMs and 3D reconstruction.
#### Fields
The dataset contains the following fields:
| Field Name | Description |
| :--------- | :---------- |
| `id` | Global index of the entry in the dataset |
| `dataset` | Video source: `scannet`, `arkitscenes` or `scannetpp` |
| `scene_name` | Scene (video) name for each question-answer pair |
| `question_type` | The type of task for question |
| `question` | Question asked about the video |
| `options` | Choices for the question (only for multiple choice questions) |
| `ground_truth` | Ground truth answer for the question |
| `pruned` | Boolean indicating if example was removed by Iterative Bias Pruning (IBP) |
### Why VSI-Bench-Debiased?
While the original VSI-Bench was designed to require visual understanding, our follow-up analysis revealed that a portion of questions could be answered using non-visual shortcuts—such as statistical biases in answer distributions or world knowledge priors—without actually processing the visual input.
**VSI-Bench-Debiased** addresses this through systematic robustification:
1. **Test-set Stress-Test (TsT)**: We applied k-fold cross-validation directly on the test set to identify samples with high non-visual solvability, assigning each sample a bias score.
2. **Iterative Bias Pruning (IBP)**: We iteratively removed samples with the highest bias scores, creating a subset that better compels genuine visual reasoning.
**Key improvements in VSI-Bench-Debiased:**
- **Reduced non-visual solvability**: Blind models (text-only, no vision) perform closer to chance
- **Wider vision-blind gap**: Greater performance difference between vision-enabled and vision-disabled models
- **Better isolation of visual reasoning**: Fine-tuning on in-distribution data improves vision-enabled performance much more than blind performance, confirming reduced shortcut reliance
For researchers interested in robust evaluation of visual-spatial intelligence, **we recommend reporting results on both the full and debiased subsets** to provide comprehensive assessment.
## Usage
### Dataset Configurations
This dataset provides three configurations for flexible evaluation:
| Config | Description | Usage |
|--------|-------------|-------|
| `full` (default) | All 5,131 examples with `pruned` column | Load all data, filter as needed |
| `debiased` | 2,363 examples (non-pruned subset) | Evaluate on robustified benchmark |
| `pruned` | 2,768 examples (pruned by IBP) | Analyze removed samples |
#### Loading the Dataset Annotations
##### Load specific configuration
If you want to load just a specific subset, you can use the config name with the `load_dataset` function as follows:
```python
from datasets import load_dataset
# Load full dataset (default)
vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench")
# or use the config name "full"
vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench", "full")
# Load debiased version only
vsi_bench_debiased = load_dataset("nyu-visionx/VSI-Bench", "debiased")
# Load pruned examples only
vsi_bench_pruned = load_dataset("nyu-visionx/VSI-Bench", "pruned")
```
##### Load full dataset and filter using `pruned` column (recommended)
> [!TIP]
> **For LMMS-Eval users:** We have updated the `vsi-bench` task to automatically report scores on both full and debiased subsets. (TODO: LINK).
We recommend loading the "full" set, evaluating on all samples, and then using the `pruned` column to compute scores on both the full and debiased subsets.
```python
from datasets import load_dataset
# Load full dataset with pruned annotations
vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench")
# Evaluate on full set
model_predictions = evaluate_model(vsi_bench_full)
# Score on both the full and debiased subsets
full_acc = compute_accuracy(model_predictions)
debiased_acc = compute_accuracy(model_predictions.filter(lambda x: not x["pruned"]))
```
### Evaluation
> [!TIP]
> ***TODO: link to the LMMS Eval Code***
VSI-Bench evaluates performance using two metrics: for multiple-choice questions, we use `Accuracy`, calculated based on exact matches. For numerical-answer questions, we introduce a new metric, `MRA (Mean Relative Accuracy)`, to assess how closely model predictions align with ground truth values.
We provide an out-of-the-box evaluation of VSI-Bench in our [GitHub repository](https://github.com/vision-x-nyu/thinking-in-space), including the [metrics](https://github.com/vision-x-nyu/thinking-in-space/blob/main/lmms_eval/tasks/vsibench/utils.py#L109C1-L155C36) implementation used in our framework. For further detailes, users can refer to our paper and GitHub repository.
## Files
- `test-*.parquet`: Parquet files containing dataset annotations (questions, answers, metadata).
* `test_debiased.parquet`: Annotations for the debiased subset (2,363 examples)
* `test_pruned.parquet`: Annotations for the pruned subset (2,768 examples)
- `*.zip`: Compressed video files for the dataset
* `arkitscenes.zip`: Videos for the ARKitScenes dataset
* `scannet.zip`: Videos for the ScanNet dataset
* `scannetpp.zip`: Videos for the ScanNet++ dataset
- `pruned_ids.txt`: List of example IDs removed by Iterative Bias Pruning
- `create_pq.py`: Convenience script to regenerate parquet files from `test.jsonl` and `pruned_ids.txt`. Can be run with `uv run create_pq.py`.
## Citation
If you use these datasets in your research, please cite the original VSI-Bench paper and our debiasing paper that produced VSI-Bench-Debiased:
```bibtex
@inproceedings{yang2025thinking,
title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}},
author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali and Han, Rilyn and Fei-Fei, Li and Xie, Saining},
booktitle={CVPR},
year={2025},
}
@article{brown2025benchmark,
title={{Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts}},
author={Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining},
year={2025},
journal={arXiv preprint arXiv:2511.04655},
}
```
许可证:Apache-2.0
任务类别:视觉问答(visual-question-answering)
语言:英语
标签:视频、文本
样本规模:1000 < 样本量 < 10000
配置项:
1. 完整(full):默认配置,测试集拆分,数据路径为`test*.parquet`
2. 去偏(debiased):测试集拆分,数据路径为`test_debiased.parquet`
3. 剪枝(pruned):测试集拆分,数据路径为`test_pruned.parquet`
| 数据集 | arXiv预印本 | 官方网站 | 代码仓库 |
| :------ | :---- | :------ | :--- |
| **VSI-Bench** | <a href="https://arxiv.org/abs/2412.14171" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-thinking--in--space-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/thinking-in-space.github.io/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-thinking--in--space-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/thinking-in-space" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-thinking--in--space-white?&logo=github&logoColor=white" /></a> |
| **VSI-Bench-Debiased** | <a href="https://arxiv.org/abs/2511.04655" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-test--set--stress--test-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/test-set-training/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-test--set--stress--test-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/test-set-training" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-test--set--stress--test-white?&logo=github&logoColor=white" /></a> |
[重要提示]
[2025年11月7日] 更新:本数据集已新增“去偏(debiased)”子集,遵循TsT剪枝方法论(https://vision-x-nyu.github.io/test-set-training/)
# 视觉空间智能基准测试集(VSI-Bench 与 VSI-Bench-Debiased)
本仓库包含《Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces》(arXiv:2412.14171)中提出的视觉空间智能基准测试集(Visual-Spatial Intelligence Benchmark,VSI-Bench),以及其后续鲁棒化研究中提出的去偏版本**VSI-Bench-Debiased**,该后续研究论文为《Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts》(arXiv:2511.04655)。
## 概述
**VSI-Bench**通过第一人称视频理解任务评估多模态大语言模型(Multimodal Large Language Model, MLLM)的视觉空间智能,包含来自真实室内场景的5000余条问答对。
**VSI-Bench-Debiased**是鲁棒化后的版本,通过我们提出的测试集压力测试(Test-set Stress-Test, TsT)与迭代偏差剪枝(Iterative Bias Pruning, IBP)方法移除非视觉捷径,能够更好地隔离模型的视觉推理能力,系统性剔除了无需视觉输入即可解答的样本。
### 数据集描述
VSI-Bench从第一人称视角视频维度定量评估多模态大语言模型的视觉空间智能。该数据集包含源自288条真实视频的5000余条问答对,这些视频取自公开室内3D场景重建数据集`ScanNet`、`ScanNet++`与`ARKitScenes`的验证集,涵盖住宅空间、专业场景(如办公室、实验室)、工业空间(如工厂)等多样环境,且来自多个地理区域。通过复用现有3D重建与理解数据集,VSI-Bench依托精准的对象级标注生成问答对,可支持未来探索多模态大语言模型与3D重建之间关联的研究。
#### 数据字段
该数据集包含以下字段:
| 字段名 | 描述 |
| :--------- | :---------- |
| `id` | 数据集中条目的全局索引 |
| `dataset` | 视频来源:`scannet`、`arkitscenes`或`scannetpp` |
| `scene_name` | 每个问答对对应的场景(视频)名称 |
| `question_type` | 问答任务的类型 |
| `question` | 针对视频提出的问题 |
| `options` | 选择题的候选答案(仅适用于选择题型) |
| `ground_truth` | 问题的标准答案 |
| `pruned` | 布尔值,指示该样本是否被迭代偏差剪枝(IBP)移除 |
### 为何需要VSI-Bench-Debiased?
尽管原始VSI-Bench的设计目标是要求模型具备视觉理解能力,但后续分析发现,部分问题可通过非视觉捷径解答——例如答案分布的统计偏差或先验世界知识——而无需处理视觉输入。
**VSI-Bench-Debiased**通过系统性鲁棒化解决了这一问题:
1. **测试集压力测试(TsT)**:我们直接在测试集上应用k折交叉验证,识别出具有高非视觉可解性的样本,并为每个样本分配偏差得分。
2. **迭代偏差剪枝(IBP)**:我们迭代移除偏差得分最高的样本,生成一个更能促使模型进行真正视觉推理的子集。
**VSI-Bench-Debiased的核心改进**:
- **降低非视觉可解性**:盲模型(仅文本、无视觉输入)的表现更接近随机猜测
- **扩大视觉-盲模型差距**:启用视觉与禁用视觉的模型之间的性能差异更显著
- **更好地隔离视觉推理能力**:在同分布数据上微调时,启用视觉的模型性能提升远高于盲模型,证实模型对捷径的依赖有所降低
对于希望稳健评估视觉空间智能的研究人员,**我们建议同时报告完整子集与去偏子集的结果**,以提供全面的评估。
## 使用方法
### 数据集配置
本数据集提供三种配置以支持灵活评估:
| 配置 | 描述 | 用法 |
|--------|-------------|-------|
| `full`(默认) | 全部5131条样本,包含`pruned`列 | 加载全部数据,按需过滤 |
| `debiased` | 2363条样本(未被剪枝的子集) | 在鲁棒化基准测试集上进行评估 |
| `pruned` | 2768条样本(被IBP剪枝的子集) | 分析被移除的样本 |
#### 加载数据集标注
##### 加载特定配置
若仅需加载特定子集,可通过`load_dataset`函数结合配置名称实现,示例如下:
python
from datasets import load_dataset
# 加载完整数据集(默认配置)
vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench")
# 或显式指定配置名称"full"
vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench", "full")
# 仅加载去偏版本
vsi_bench_debiased = load_dataset("nyu-visionx/VSI-Bench", "debiased")
# 仅加载被剪枝的样本
vsi_bench_pruned = load_dataset("nyu-visionx/VSI-Bench", "pruned")
##### 加载完整数据集并通过`pruned`列过滤(推荐方式)
[提示]
**针对LMMS-Eval用户**:我们已更新`vsi-bench`任务,可自动报告完整子集与去偏子集的得分。(待补充链接)
我们建议加载“完整”配置集,对所有样本进行评估,再通过`pruned`列分别计算完整子集与去偏子集的得分。
python
from datasets import load_dataset
# 加载包含剪枝标注的完整数据集
vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench")
# 评估模型
model_predictions = evaluate_model(vsi_bench_full)
# 分别计算完整子集与去偏子集的准确率
full_acc = compute_accuracy(model_predictions)
debiased_acc = compute_accuracy(model_predictions.filter(lambda x: not x["pruned"]))
### 模型评估
[提示]
[待补充:LMMS Eval代码链接]
VSI-Bench采用两种指标评估模型性能:对于选择题型,使用准确率(Accuracy),基于精确匹配计算;对于数值答案题型,我们提出了新的指标`MRA(平均相对准确率,Mean Relative Accuracy)`,用于评估模型预测值与标准答案的对齐程度。
我们在[GitHub仓库](https://github.com/vision-x-nyu/thinking-in-space)中提供了VSI-Bench的开箱即用评估工具,包括本框架中使用的[指标实现](https://github.com/vision-x-nyu/thinking-in-space/blob/main/lmms_eval/tasks/vsibench/utils.py#L109C1-L155C36)。如需进一步细节,用户可参考我们的论文与GitHub仓库。
## 文件说明
- `test-*.parquet`:包含数据集标注的Parquet格式文件(包含问题、答案与元数据)。
* `test_debiased.parquet`:去偏子集的标注文件(2363条样本)
* `test_pruned.parquet`:剪枝子集的标注文件(2768条样本)
- `*.zip`:数据集的压缩视频文件
* `arkitscenes.zip`:ARKitScenes数据集对应的视频文件
* `scannet.zip`:ScanNet数据集对应的视频文件
* `scannetpp.zip`:ScanNet++数据集对应的视频文件
- `pruned_ids.txt`:被迭代偏差剪枝移除的样本ID列表
- `create_pq.py`:便捷脚本,可从`test.jsonl`与`pruned_ids.txt`重新生成Parquet文件,可通过`uv run create_pq.py`运行。
## 引用格式
若在研究中使用本数据集,请同时引用原始VSI-Bench论文与生成VSI-Bench-Debiased的去偏研究论文:
bibtex
@inproceedings{yang2025thinking,
title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}},
author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali and Han, Rilyn and Fei-Fei, Li and Xie, Saining},
booktitle={CVPR},
year={2025},
}
@article{brown2025benchmark,
title={{Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts}},
author={Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining},
year={2025},
journal={arXiv preprint arXiv:2511.04655},
}
提供机构:
thomas-yanxin



