five

VSI-Bench

收藏
魔搭社区2026-05-14 更新2024-12-28 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/VSI-Bench
下载链接
链接失效反馈
官方服务:
资源简介:
<!-- <div align="center"> --> | Dataset | arXiv | Website | Code | | :------ | :---- | :------ | :--- | | **VSI-Bench** | <a href="https://arxiv.org/abs/2412.14171" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-thinking--in--space-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/thinking-in-space.github.io/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-thinking--in--space-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/thinking-in-space" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-thinking--in--space-white?&logo=github&logoColor=white" /></a> | | **VSI-Bench-Debiased** | <a href="https://arxiv.org/abs/2511.04655" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-test--set--stress--test-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/test-set-training/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-test--set--stress--test-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/test-set-training" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-test--set--stress--test-white?&logo=github&logoColor=white" /></a> | <!-- </div> --> <br> > [!IMPORTANT] > ***[Nov. 7, 2025] UPDATE:** This Dataset has been updated to include a "Debiased" subset following the [TsT Pruning Methodology](https://vision-x-nyu.github.io/test-set-training/)* <br> # Visual-Spatial Intelligence Benchmark (VSI-Bench & VSI-Bench-Debiased) This repository contains the visual spatial intelligence benchmark (VSI-Bench), introduced in [Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces](https://arxiv.org/abs/2412.14171), and its debiased counterpart **VSI-Bench-Debiased**, introduced in our follow-up work on systematic benchmark robustification [Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts](https://arxiv.org/abs/2511.04655). ## Overview **VSI-Bench** evaluates visual-spatial intelligence of multimodal models through egocentric video understanding, comprising over 5,000 question-answer pairs from real-world indoor scenes. **VSI-Bench-Debiased** is a robustified version that reduces non-visual shortcuts using our Test-set Stress-Test (TsT) and Iterative Bias Pruning (IBP) methodology. This version better isolates visual reasoning capabilities by systematically removing samples that can be solved without visual input. ### Description VSI-Bench quantitatively evaluates the visual-spatial intelligence of MLLMs from egocentric video. VSI-Bench comprises over 5,000 question-answer pairs derived from 288 real videos. These videos are sourced from the validation sets of the public indoor 3D scene reconstruction datasets `ScanNet`, `ScanNet++`, and `ARKitScenes`, and represent diverse environments -- including residential spaces, professional settings (e.g., offices, labs), and industrial spaces (e.g., factories) and multiple geographic regions. By repurposing these existing 3D reconstruction and understanding datasets, VSI-Bench benefits from accurate object-level annotations, which are used in question generation and could support future studies exploring the connection between MLLMs and 3D reconstruction. #### Fields The dataset contains the following fields: | Field Name | Description | | :--------- | :---------- | | `id` | Global index of the entry in the dataset | | `dataset` | Video source: `scannet`, `arkitscenes` or `scannetpp` | | `scene_name` | Scene (video) name for each question-answer pair | | `question_type` | The type of task for question | | `question` | Question asked about the video | | `options` | Choices for the question (only for multiple choice questions) | | `ground_truth` | Ground truth answer for the question | | `pruned` | Boolean indicating if example was removed by Iterative Bias Pruning (IBP) | ### Why VSI-Bench-Debiased? While the original VSI-Bench was designed to require visual understanding, our follow-up analysis revealed that a portion of questions could be answered using non-visual shortcuts—such as statistical biases in answer distributions or world knowledge priors—without actually processing the visual input. **VSI-Bench-Debiased** addresses this through systematic robustification: 1. **Test-set Stress-Test (TsT)**: We applied k-fold cross-validation directly on the test set to identify samples with high non-visual solvability, assigning each sample a bias score. 2. **Iterative Bias Pruning (IBP)**: We iteratively removed samples with the highest bias scores, creating a subset that better compels genuine visual reasoning. **Key improvements in VSI-Bench-Debiased:** - **Reduced non-visual solvability**: Blind models (text-only, no vision) perform closer to chance - **Wider vision-blind gap**: Greater performance difference between vision-enabled and vision-disabled models - **Better isolation of visual reasoning**: Fine-tuning on in-distribution data improves vision-enabled performance much more than blind performance, confirming reduced shortcut reliance For researchers interested in robust evaluation of visual-spatial intelligence, **we recommend reporting results on both the full and debiased subsets** to provide comprehensive assessment. ## Usage ### Dataset Configurations This dataset provides three configurations for flexible evaluation: | Config | Description | Usage | |--------|-------------|-------| | `full` (default) | All 5,131 examples with `pruned` column | Load all data, filter as needed | | `debiased` | 2,363 examples (non-pruned subset) | Evaluate on robustified benchmark | | `pruned` | 2,768 examples (pruned by IBP) | Analyze removed samples | #### Loading the Dataset Annotations ##### Load specific configuration If you want to load just a specific subset, you can use the config name with the `load_dataset` function as follows: ```python from datasets import load_dataset # Load full dataset (default) vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench") # or use the config name "full" vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench", "full") # Load debiased version only vsi_bench_debiased = load_dataset("nyu-visionx/VSI-Bench", "debiased") # Load pruned examples only vsi_bench_pruned = load_dataset("nyu-visionx/VSI-Bench", "pruned") ``` ##### Load full dataset and filter using `pruned` column (recommended) > [!TIP] > **For LMMS-Eval users:** We have updated the `vsi-bench` task to automatically report scores on both full and debiased subsets. (TODO: LINK). We recommend loading the "full" set, evaluating on all samples, and then using the `pruned` column to compute scores on both the full and debiased subsets. ```python from datasets import load_dataset # Load full dataset with pruned annotations vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench") # Evaluate on full set model_predictions = evaluate_model(vsi_bench_full) # Score on both the full and debiased subsets full_acc = compute_accuracy(model_predictions) debiased_acc = compute_accuracy(model_predictions.filter(lambda x: not x["pruned"])) ``` ### Evaluation > [!TIP] > ***TODO: link to the LMMS Eval Code*** VSI-Bench evaluates performance using two metrics: for multiple-choice questions, we use `Accuracy`, calculated based on exact matches. For numerical-answer questions, we introduce a new metric, `MRA (Mean Relative Accuracy)`, to assess how closely model predictions align with ground truth values. We provide an out-of-the-box evaluation of VSI-Bench in our [GitHub repository](https://github.com/vision-x-nyu/thinking-in-space), including the [metrics](https://github.com/vision-x-nyu/thinking-in-space/blob/main/lmms_eval/tasks/vsibench/utils.py#L109C1-L155C36) implementation used in our framework. For further detailes, users can refer to our paper and GitHub repository. ## Files - `test-*.parquet`: Parquet files containing dataset annotations (questions, answers, metadata). * `test_debiased.parquet`: Annotations for the debiased subset (2,363 examples) * `test_pruned.parquet`: Annotations for the pruned subset (2,768 examples) - `*.zip`: Compressed video files for the dataset * `arkitscenes.zip`: Videos for the ARKitScenes dataset * `scannet.zip`: Videos for the ScanNet dataset * `scannetpp.zip`: Videos for the ScanNet++ dataset - `pruned_ids.txt`: List of example IDs removed by Iterative Bias Pruning - `create_pq.py`: Convenience script to regenerate parquet files from `test.jsonl` and `pruned_ids.txt`. Can be run with `uv run create_pq.py`. ## Citation If you use these datasets in your research, please cite the original VSI-Bench paper and our debiasing paper that produced VSI-Bench-Debiased: ```bibtex @inproceedings{yang2025thinking, title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}}, author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali and Han, Rilyn and Fei-Fei, Li and Xie, Saining}, booktitle={CVPR}, year={2025}, } @article{brown2025benchmark, title={{Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts}}, author={Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining}, year={2025}, journal={arXiv preprint arXiv:2511.04655}, } ```

<!-- <div align="center"> --> | 数据集 | arXiv预印本 | 官方网站 | 代码仓库 | | :------ | :---- | :------ | :--- | | **VSI-Bench** | <a href="https://arxiv.org/abs/2412.14171" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-thinking--in--space-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/thinking-in-space.github.io/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-thinking--in--space-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/thinking-in-space" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-thinking--in--space-white?&logo=github&logoColor=white" /></a> | | **VSI-Bench-Debiased** | <a href="https://arxiv.org/abs/2511.04655" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-test--set--stress--test-red?logo=arxiv" height="20" /></a> | <a href="https://vision-x-nyu.github.io/test-set-training/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Website-test--set--stress--test-blue.svg" height="20" /></a> | <a href="https://github.com/vision-x-nyu/test-set-training" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-test--set--stress--test-white?&logo=github&logoColor=white" /></a> | <!-- </div> --> <br> > [!IMPORTANT] > ***[2025年11月7日 更新]:** 本数据集已新增“去偏”子集,其遵循[TsT剪枝方法论](https://vision-x-nyu.github.io/test-set-training/)。* <br> # 视觉空间智能基准测试集(Visual-Spatial Intelligence Benchmark,简称VSI-Bench 与 VSI-Bench-Debiased) 本仓库包含发表于《Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces》(arXiv:2412.14171)的视觉空间智能基准测试集VSI-Bench,以及其后续工作中提出的去偏版本**VSI-Bench-Debiased**,该后续工作聚焦系统化基准测试集鲁棒化:《Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts》(arXiv:2511.04655)。 ## 概述 **VSI-Bench** 通过以自我为中心的视频理解任务,评估多模态模型的视觉空间智能,包含来自真实室内场景的5000余条问答对。 **VSI-Bench-Debiased** 是鲁棒化后的版本,通过我们提出的测试集压力测试(Test-set Stress-Test, TsT)与迭代偏差剪枝(Iterative Bias Pruning, IBP)方法论,减少了非视觉捷径偏差,能够更精准地隔离视觉推理能力,系统性移除了无需视觉输入即可解答的样本。 ### 数据集说明 VSI-Bench 从以自我为中心的视频视角,量化评估多模态大语言模型(Multimodal Large Language Model, MLLM)的视觉空间智能。该基准包含源自288个真实视频的5000余条问答对。这些视频取自公开室内3D场景重建数据集`ScanNet`、`ScanNet++`与`ARKitScenes`的验证集,涵盖多样化的环境类型,包括住宅空间、专业场景(如办公室、实验室)以及工业空间(如工厂),并覆盖多个地理区域。通过复用现有3D重建与理解数据集,VSI-Bench 可借助精准的物体级标注生成问答对,这一特性可支撑未来探索多模态大语言模型与3D重建之间关联的研究。 #### 数据字段 | 字段名 | 描述 | | :--------- | :---------- | | `id` | 数据集中条目的全局索引 | | `dataset` | 视频来源:`scannet`、`arkitscenes` 或 `scannetpp` | | `scene_name` | 每个问答对对应的场景(视频)名称 | | `question_type` | 问答任务的类型 | | `question` | 针对视频提出的问题 | | `options` | 问题的选项(仅适用于选择题) | | `ground_truth` | 问题的标准答案 | | `pruned` | 布尔值,指示该样本是否被迭代偏差剪枝(Iterative Bias Pruning, IBP)移除 | ### 为何需要VSI-Bench-Debiased? 尽管原始VSI-Bench的设计目标是要求模型具备视觉理解能力,但后续分析发现,部分问题可通过非视觉捷径解答——例如答案分布的统计偏差或世界知识先验,而无需实际处理视觉输入。 **VSI-Bench-Debiased** 通过系统化鲁棒化解决了这一问题: 1. **测试集压力测试(Test-set Stress-Test, TsT)**:我们直接在测试集上应用k折交叉验证,以识别具有高非视觉可解性的样本,并为每个样本分配偏差分数。 2. **迭代偏差剪枝(Iterative Bias Pruning, IBP)**:我们迭代移除偏差分数最高的样本,生成了一个更能激发真正视觉推理的子集。 **VSI-Bench-Debiased 的核心改进:** - **降低非视觉可解性**:仅使用文本的盲模型(无视觉输入)的表现更接近随机猜测水平 - **扩大视觉-盲模型差距**:具备视觉输入的模型与盲模型之间的性能差异更加显著 - **更好地隔离视觉推理**:在同分布数据上微调后,具备视觉输入的模型性能提升远高于盲模型,证实了其对非视觉捷径的依赖程度降低 对于希望稳健评估视觉空间智能的研究人员,**我们建议同时报告完整基准与去偏子集的结果**,以提供全面的评估。 ## 使用方法 ### 数据集配置 本数据集提供三种配置以支持灵活评估: | 配置 | 描述 | 用途 | |--------|-------------|-------| | `full`(默认) | 全部5131条样本,包含`pruned`列 | 加载全部数据,按需进行过滤 | | `debiased` | 2363条样本(未被剪枝的子集) | 在鲁棒化后的基准测试集上进行评估 | | `pruned` | 2768条样本(被IBP剪枝的子集) | 分析被移除的样本 | #### 加载数据集标注 ##### 加载指定配置 若仅需加载特定子集,可通过`load_dataset`函数结合配置名称实现,示例如下: python from datasets import load_dataset # 加载完整数据集(默认配置) vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench") # 或显式指定配置名称"full" vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench", "full") # 仅加载去偏版本 vsi_bench_debiased = load_dataset("nyu-visionx/VSI-Bench", "debiased") # 仅加载被剪枝的样本 vsi_bench_pruned = load_dataset("nyu-visionx/VSI-Bench", "pruned") ##### 加载完整数据集并通过`pruned`列进行过滤(推荐) > [!TIP] > **针对LMMS-Eval用户:** 我们已更新`vsi-bench`任务,使其可自动报告完整子集与去偏子集的得分。(待补充链接)。 我们推荐加载“full”配置的完整数据集,对所有样本进行评估,再通过`pruned`列分别计算完整子集与去偏子集的得分。 python from datasets import load_dataset # 加载包含剪枝标注的完整数据集 vsi_bench_full = load_dataset("nyu-visionx/VSI-Bench") # 在完整数据集上评估模型 model_predictions = evaluate_model(vsi_bench_full) # 分别计算完整子集与去偏子集的准确率 full_acc = compute_accuracy(model_predictions) debiased_acc = compute_accuracy(model_predictions.filter(lambda x: not x["pruned"])) ### 模型评估 > [!TIP] > ***待补充:LMMS Eval代码链接*** VSI-Bench 使用两种指标评估模型性能:对于选择题,采用准确率(Accuracy),基于精确匹配计算;对于数值答案类问题,我们提出了新的指标`MRA(平均相对准确率,Mean Relative Accuracy)`,用于评估模型预测值与标准答案的对齐程度。 我们在[GitHub仓库](https://github.com/vision-x-nyu/thinking-in-space)中提供了VSI-Bench的开箱即用评估方案,包括我们框架中使用的[指标实现](https://github.com/vision-x-nyu/thinking-in-space/blob/main/lmms_eval/tasks/vsibench/utils.py#L109C1-L155C36)。如需进一步细节,用户可参阅我们的论文与GitHub仓库。 ## 文件说明 - `test-*.parquet`:包含数据集标注(问答对、元数据)的Parquet文件。 * `test_debiased.parquet`:去偏子集的标注(2363条样本) * `test_pruned.parquet`:被剪枝子集的标注(2768条样本) - `*.zip`:数据集的压缩视频文件 * `arkitscenes.zip`:ARKitScenes数据集的视频文件 * `scannet.zip`:ScanNet数据集的视频文件 * `scannetpp.zip`:ScanNet++数据集的视频文件 - `pruned_ids.txt`:被迭代偏差剪枝移除的样本ID列表 - `create_pq.py`:便捷脚本,可通过`uv run create_pq.py`运行,用于从`test.jsonl`与`pruned_ids.txt`重新生成Parquet文件。 ## 引用说明 若您在研究中使用本数据集,请同时引用原始VSI-Bench论文与生成VSI-Bench-Debiased的去偏论文: bibtex @inproceedings{yang2025thinking, title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}}, author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali and Han, Rilyn and Fei-Fei, Li and Xie, Saining}, booktitle={CVPR}, year={2025}, } @article{brown2025benchmark, title={{Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts}}, author={Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining}, year={2025}, journal={arXiv preprint arXiv:2511.04655}, }
提供机构:
maas
创建时间:
2024-12-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作