VSI-Bench

Name: VSI-Bench
Creator: maas
Published: 2026-05-14 21:35:58
License: 暂无描述

魔搭社区2026-05-14 更新2024-12-28 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/VSI-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

| Dataset | arXiv | Website | Code | | :------ | :---- | :------ | :--- | | **VSI-Bench** | <a href="https://arxiv.org/abs/2412.14171" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-thinking--in--space-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/thinking-in-space.github.io/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-thinking--in--space-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/thinking-in-space" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-thinking--in--space-white?&logo=github&logoColor=white" /></a> | | **VSI-Bench-Debiased** | <a href="https://arxiv.org/abs/2511.04655" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-test--set--stress--test-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/test-set-training/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-test--set--stress--test-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/test-set-training" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-test--set--stress--test-white?&logo=github&logoColor=white" /></a> |  <br> > [!IMPORTANT] > ***[Nov. 7, 2025] UPDATE:** This Dataset has been updated to include a "Debiased" subset following the [TsT Pruning Methodology](https://vision-x-nyu.github.io/test-set-training/)* <br> # Visual-Spatial Intelligence Benchmark (VSI-Bench & VSI-Bench-Debiased) This repository contains the visual spatial intelligence benchmark (VSI-Bench), introduced in [Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces](https://arxiv.org/abs/2412.14171), and its debiased counterpart **VSI-Bench-Debiased**, introduced in our follow-up work on systematic benchmark robustification [Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts](https://arxiv.org/abs/2511.04655). ## Overview **VSI-Bench** evaluates visual-spatial intelligence of multimodal models through egocentric video understanding, comprising over 5,000 question-answer pairs from real-world indoor scenes. **VSI-Bench-Debiased** is a robustified version that reduces non-visual shortcuts using our Test-set Stress-Test (TsT) and Iterative Bias Pruning (IBP) methodology. This version better isolates visual reasoning capabilities by systematically removing samples that can be solved without visual input. ### Description VSI-Bench quantitatively evaluates the visual-spatial intelligence of MLLMs from egocentric video. VSI-Bench comprises over 5,000 question-answer pairs derived from 288 real videos. These videos are sourced from the validation sets of the public indoor 3D scene reconstruction datasets `ScanNet`, `ScanNet++`, and `ARKitScenes`, and represent diverse environments -- including residential spaces, professional settings (e.g., offices, labs), and industrial spaces (e.g., factories) and multiple geographic regions. By repurposing these existing 3D reconstruction and understanding datasets, VSI-Bench benefits from accurate object-level annotations, which are used in question generation and could support future studies exploring the connection between MLLMs and 3D reconstruction. #### Fields The dataset contains the following fields: | Field Name | Description | | :--------- | :---------- | | `id` | Global index of the entry in the dataset | | `dataset` | Video source: `scannet`, `arkitscenes` or `scannetpp` | | `scene_name` | Scene (video) name for each question-answer pair | | `question_type` | The type of task for question | | `question` | Question asked about the video | | `options` | Choices for the question (only for multiple choice questions) | | `ground_truth` | Ground truth answer for the question | | `pruned` | Boolean indicating if example was removed by Iterative Bias Pruning (IBP) | ### Why VSI-Bench-Debiased? While the original VSI-Bench was designed to require visual understanding, our follow-up analysis revealed that a portion of questions could be answered using non-visual shortcuts—such as statistical biases in answer distributions or world knowledge priors—without actually processing the visual input. **VSI-Bench-Debiased** addresses this through systematic robustification: 1. **Test-set Stress-Test (TsT)**: We applied k-fold cross-validation directly on the test set to identify samples with high non-visual solvability, assigning each sample a bias score. 2. **Iterative Bias Pruning (IBP)**: We iteratively removed samples with the highest bias scores, creating a subset that better compels genuine visual reasoning. **Key improvements in VSI-Bench-Debiased:** - **Reduced non-visual solvability**: Blind models (text-only, no vision) perform closer to chance - **Wider vision-blind gap**: Greater performance difference between vision-enabled and vision-disabled models - **Better isolation of visual reasoning**: Fine-tuning on in-distribution data improves vision-enabled performance much more than blind performance, confirming reduced shortcut reliance For researchers interested in robust evaluation of visual-spatial intelligence, **we recommend reporting results on both the full and debiased subsets** to provide comprehensive assessment. ## Usage ### Dataset Configurations This dataset provides three configurations for flexible evaluation: | Config | Description | Usage | |--------|-------------|-------| | `full` (default) | All 5,131 examples with `pruned` column | Load all data, filter as needed | | `debiased` | 2,363 examples (non-pruned subset) | Evaluate on robustified benchmark | | `pruned` | 2,768 examples (pruned by IBP) | Analyze removed samples | #### Loading the Dataset Annotations ##### Load specific configuration If you want to load just a specific subset, you can use the config name with the `load_dataset` function as follows: ```python from datasets import load_dataset # Load full dataset (default) vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench") # or use the config name "full" vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench", "full") # Load debiased version only vsi_bench_debiased = load_dataset("nyu-visionx/VSI-Bench", "debiased") # Load pruned examples only vsi_bench_pruned = load_dataset("nyu-visionx/VSI-Bench", "pruned") ``` ##### Load full dataset and filter using `pruned` column (recommended) > [!TIP] > **For LMMS-Eval users:** We have updated the `vsi-bench` task to automatically report scores on both full and debiased subsets. (TODO: LINK). We recommend loading the "full" set, evaluating on all samples, and then using the `pruned` column to compute scores on both the full and debiased subsets. ```python from datasets import load_dataset # Load full dataset with pruned annotations vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench") # Evaluate on full set model_predictions = evaluate_model(vsi_bench_full) # Score on both the full and debiased subsets full_acc = compute_accuracy(model_predictions) debiased_acc = compute_accuracy(model_predictions.filter(lambda x: not x["pruned"])) ``` ### Evaluation > [!TIP] > ***TODO: link to the LMMS Eval Code*** VSI-Bench evaluates performance using two metrics: for multiple-choice questions, we use `Accuracy`, calculated based on exact matches. For numerical-answer questions, we introduce a new metric, `MRA (Mean Relative Accuracy)`, to assess how closely model predictions align with ground truth values. We provide an out-of-the-box evaluation of VSI-Bench in our [GitHub repository](https://github.com/vision-x-nyu/thinking-in-space), including the [metrics](https://github.com/vision-x-nyu/thinking-in-space/blob/main/lmms_eval/tasks/vsibench/utils.py#L109C1-L155C36) implementation used in our framework. For further detailes, users can refer to our paper and GitHub repository. ## Files - `test-*.parquet`: Parquet files containing dataset annotations (questions, answers, metadata). * `test_debiased.parquet`: Annotations for the debiased subset (2,363 examples) * `test_pruned.parquet`: Annotations for the pruned subset (2,768 examples) - `*.zip`: Compressed video files for the dataset * `arkitscenes.zip`: Videos for the ARKitScenes dataset * `scannet.zip`: Videos for the ScanNet dataset * `scannetpp.zip`: Videos for the ScanNet++ dataset - `pruned_ids.txt`: List of example IDs removed by Iterative Bias Pruning - `create_pq.py`: Convenience script to regenerate parquet files from `test.jsonl` and `pruned_ids.txt`. Can be run with `uv run create_pq.py`. ## Citation If you use these datasets in your research, please cite the original VSI-Bench paper and our debiasing paper that produced VSI-Bench-Debiased: ```bibtex @inproceedings{yang2025thinking, title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}}, author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali and Han, Rilyn and Fei-Fei, Li and Xie, Saining}, booktitle={CVPR}, year={2025}, } @article{brown2025benchmark, title={{Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts}}, author={Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining}, year={2025}, journal={arXiv preprint arXiv:2511.04655}, } ```

| 数据集 | arXiv预印本 | 官方网站 | 代码仓库 | | :------ | :---- | :------ | :--- | | **VSI-Bench** | <a href="https://arxiv.org/abs/2412.14171" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-thinking--in--space-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/thinking-in-space.github.io/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-thinking--in--space-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/thinking-in-space" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-thinking--in--space-white?&logo=github&logoColor=white" /></a> | | **VSI-Bench-Debiased** | <a href="https://arxiv.org/abs/2511.04655" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-test--set--stress--test-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/test-set-training/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-test--set--stress--test-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/test-set-training" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-test--set--stress--test-white?&logo=github&logoColor=white" /></a> |  <br> > [!IMPORTANT] > ***[2025年11月7日更新]：** 本数据集已新增“去偏”子集，其遵循[TsT剪枝方法论](https://vision-x-nyu.github.io/test-set-training/)。* <br> # 视觉空间智能基准测试集（Visual-Spatial Intelligence Benchmark，简称VSI-Bench 与 VSI-Bench-Debiased）本仓库包含发表于《Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces》（arXiv:2412.14171）的视觉空间智能基准测试集VSI-Bench，以及其后续工作中提出的去偏版本**VSI-Bench-Debiased**，该后续工作聚焦系统化基准测试集鲁棒化：《Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts》（arXiv:2511.04655）。 ## 概述 **VSI-Bench** 通过以自我为中心的视频理解任务，评估多模态模型的视觉空间智能，包含来自真实室内场景的5000余条问答对。 **VSI-Bench-Debiased** 是鲁棒化后的版本，通过我们提出的测试集压力测试（Test-set Stress-Test, TsT）与迭代偏差剪枝（Iterative Bias Pruning, IBP）方法论，减少了非视觉捷径偏差，能够更精准地隔离视觉推理能力，系统性移除了无需视觉输入即可解答的样本。 ### 数据集说明 VSI-Bench 从以自我为中心的视频视角，量化评估多模态大语言模型（Multimodal Large Language Model, MLLM）的视觉空间智能。该基准包含源自288个真实视频的5000余条问答对。这些视频取自公开室内3D场景重建数据集`ScanNet`、`ScanNet++`与`ARKitScenes`的验证集，涵盖多样化的环境类型，包括住宅空间、专业场景（如办公室、实验室）以及工业空间（如工厂），并覆盖多个地理区域。通过复用现有3D重建与理解数据集，VSI-Bench 可借助精准的物体级标注生成问答对，这一特性可支撑未来探索多模态大语言模型与3D重建之间关联的研究。 #### 数据字段 | 字段名 | 描述 | | :--------- | :---------- | | `id` | 数据集中条目的全局索引 | | `dataset` | 视频来源：`scannet`、`arkitscenes` 或 `scannetpp` | | `scene_name` | 每个问答对对应的场景（视频）名称 | | `question_type` | 问答任务的类型 | | `question` | 针对视频提出的问题 | | `options` | 问题的选项（仅适用于选择题） | | `ground_truth` | 问题的标准答案 | | `pruned` | 布尔值，指示该样本是否被迭代偏差剪枝（Iterative Bias Pruning, IBP）移除 | ### 为何需要VSI-Bench-Debiased？尽管原始VSI-Bench的设计目标是要求模型具备视觉理解能力，但后续分析发现，部分问题可通过非视觉捷径解答——例如答案分布的统计偏差或世界知识先验，而无需实际处理视觉输入。 **VSI-Bench-Debiased** 通过系统化鲁棒化解决了这一问题： 1. **测试集压力测试（Test-set Stress-Test, TsT）**：我们直接在测试集上应用k折交叉验证，以识别具有高非视觉可解性的样本，并为每个样本分配偏差分数。 2. **迭代偏差剪枝（Iterative Bias Pruning, IBP）**：我们迭代移除偏差分数最高的样本，生成了一个更能激发真正视觉推理的子集。 **VSI-Bench-Debiased 的核心改进：** - **降低非视觉可解性**：仅使用文本的盲模型（无视觉输入）的表现更接近随机猜测水平 - **扩大视觉-盲模型差距**：具备视觉输入的模型与盲模型之间的性能差异更加显著 - **更好地隔离视觉推理**：在同分布数据上微调后，具备视觉输入的模型性能提升远高于盲模型，证实了其对非视觉捷径的依赖程度降低对于希望稳健评估视觉空间智能的研究人员，**我们建议同时报告完整基准与去偏子集的结果**，以提供全面的评估。 ## 使用方法 ### 数据集配置本数据集提供三种配置以支持灵活评估： | 配置 | 描述 | 用途 | |--------|-------------|-------| | `full`（默认） | 全部5131条样本，包含`pruned`列 | 加载全部数据，按需进行过滤 | | `debiased` | 2363条样本（未被剪枝的子集） | 在鲁棒化后的基准测试集上进行评估 | | `pruned` | 2768条样本（被IBP剪枝的子集） | 分析被移除的样本 | #### 加载数据集标注 ##### 加载指定配置若仅需加载特定子集，可通过`load_dataset`函数结合配置名称实现，示例如下： python from datasets import load_dataset # 加载完整数据集（默认配置） vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench") # 或显式指定配置名称"full" vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench", "full") # 仅加载去偏版本 vsi_bench_debiased = load_dataset("nyu-visionx/VSI-Bench", "debiased") # 仅加载被剪枝的样本 vsi_bench_pruned = load_dataset("nyu-visionx/VSI-Bench", "pruned") ##### 加载完整数据集并通过`pruned`列进行过滤（推荐） > [!TIP] > **针对LMMS-Eval用户：** 我们已更新`vsi-bench`任务，使其可自动报告完整子集与去偏子集的得分。（待补充链接）。我们推荐加载“full”配置的完整数据集，对所有样本进行评估，再通过`pruned`列分别计算完整子集与去偏子集的得分。 python from datasets import load_dataset # 加载包含剪枝标注的完整数据集 vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench") # 在完整数据集上评估模型 model_predictions = evaluate_model(vsi_bench_full) # 分别计算完整子集与去偏子集的准确率 full_acc = compute_accuracy(model_predictions) debiased_acc = compute_accuracy(model_predictions.filter(lambda x: not x["pruned"])) ### 模型评估 > [!TIP] > ***待补充：LMMS Eval代码链接*** VSI-Bench 使用两种指标评估模型性能：对于选择题，采用准确率（Accuracy），基于精确匹配计算；对于数值答案类问题，我们提出了新的指标`MRA（平均相对准确率，Mean Relative Accuracy）`，用于评估模型预测值与标准答案的对齐程度。我们在[GitHub仓库](https://github.com/vision-x-nyu/thinking-in-space)中提供了VSI-Bench的开箱即用评估方案，包括我们框架中使用的[指标实现](https://github.com/vision-x-nyu/thinking-in-space/blob/main/lmms_eval/tasks/vsibench/utils.py#L109C1-L155C36)。如需进一步细节，用户可参阅我们的论文与GitHub仓库。 ## 文件说明 - `test-*.parquet`：包含数据集标注（问答对、元数据）的Parquet文件。 * `test_debiased.parquet`：去偏子集的标注（2363条样本） * `test_pruned.parquet`：被剪枝子集的标注（2768条样本） - `*.zip`：数据集的压缩视频文件 * `arkitscenes.zip`：ARKitScenes数据集的视频文件 * `scannet.zip`：ScanNet数据集的视频文件 * `scannetpp.zip`：ScanNet++数据集的视频文件 - `pruned_ids.txt`：被迭代偏差剪枝移除的样本ID列表 - `create_pq.py`：便捷脚本，可通过`uv run create_pq.py`运行，用于从`test.jsonl`与`pruned_ids.txt`重新生成Parquet文件。 ## 引用说明若您在研究中使用本数据集，请同时引用原始VSI-Bench论文与生成VSI-Bench-Debiased的去偏论文： bibtex @inproceedings{yang2025thinking, title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}}, author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali and Han, Rilyn and Fei-Fei, Li and Xie, Saining}, booktitle={CVPR}, year={2025}, } @article{brown2025benchmark, title={{Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts}}, author={Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining}, year={2025}, journal={arXiv preprint arXiv:2511.04655}, }

提供机构：

maas

创建时间：

2024-12-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集