etri-vilab/MultihopSpatial

Name: etri-vilab/MultihopSpatial
Creator: etri-vilab
Published: 2026-03-20 12:05:56
License: 暂无描述

Hugging Face2026-03-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/etri-vilab/MultihopSpatial

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - visual-question-answering - image-text-to-text language: - en tags: - spatial-reasoning - multi-hop - grounding - vision-language - benchmark - VQA - bounding-box pretty_name: MultihopSpatial size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: data/multihop_train_6791.json - split: test path: data/multihop_test_4500.json --- # MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Models <img src="teaser_2.png" width="100%" alt="MultihopSpatial Benchmark Overview"> <a href="https://youngwanlee.github.io/multihopspatial">Project Page</a> | <a href="https://arxiv.org/abs/2603.18892">Paper</a> | <a href="https://huggingface.co/etri-vilab/MultiHopSpatial-Qwen3-VL-4B-Instruct">Model</a> ## Overview **MultihopSpatial** is a benchmark designed to evaluate whether vision-language models (VLMs) demonstrate robustness in **multi-hop compositional spatial reasoning**. Unlike existing benchmarks that only assess single-step spatial relations, MultihopSpatial features queries with **1 to 3 reasoning hops** paired with **visual grounding evaluation**, exposing a critical blind spot: models achieving high multiple-choice accuracy often lack proper spatial localization. All 4,500 benchmark QA pairs and bounding boxes are **strictly annotated by ten trained human experts** with an inter-rater agreement of 90% (Krippendorff's α = 0.90). ## Key Features - **Multi-hop Composition**: Tests 1-hop, 2-hop, and 3-hop sequential spatial reasoning, mirroring real-world embodied AI needs. - **Grounded Evaluation**: Addresses the "lucky guess" problem — models must both select the correct answer AND localize it via bounding box (Acc@50IoU). - **Perspective-taking**: Includes both ego-centric and exo-centric viewpoints. - **Three Spatial Categories**: Attribute (ATT), Position (POS), and Relation (REL), composable into multi-hop questions. - **Training Data**: MultihopSpatial-Train (6,791 samples) supports post-training via reinforcement learning (e.g., GRPO). ## Dataset Statistics ### MultihopSpatial | | **Ego-centric** | **Exo-centric** | **Total** | |---|:---:|:---:|:---:| | **1-hop** | 750 | 750 | 1,500 | | **2-hop** | 750 | 750 | 1,500 | | **3-hop** | 750 | 750 | 1,500 | | **Total** | 2,250 | 2,250 | **4,500** | ### Spatial Reasoning Compositions | **Hop** | **Categories** | |---|---| | 1-hop | ATT, POS, REL | | 2-hop | ATT+POS, ATT+REL, POS+REL | | 3-hop | ATT+POS+REL | ## Data Fields | Field | Type | Description | |---|---|---| | `id` | `int` | Unique sample identifier | | `image_path` | `string` | Image filename (e.g., `000000303219.jpg` or `01ce4fd6-..._002114.jpeg`) | | `image_resolution` | `string` | Image resolution in `WxH` format | | `view` | `string` | Viewpoint type: `"ego"` (ego-centric) or `"exo"` (exo-centric) | | `hop` | `string` | Reasoning complexity: `"1hop"`, `"2hop"`, or `"3hop"` | | `question` | `string` | The spatial reasoning question in plain text with multiple-choice options | | `question_tag` | `string` | Same question with spatial reasoning type tags (`<ATT>`, `<POS>`, `<REL>`) annotated inline | | `answer` | `string` | The correct answer choice (e.g., `"(c) frame of the reed picture"`) | | `bbox` | `list[float]` | Bounding box `[x, y, width, height]` of the answer object in pixel coordinates | ### `question` vs `question_tag` - **`question`**: Clean natural language question, e.g., > *"From the perspective of the woman holding the remote control, which object is on her right?"* - **`question_tag`**: Same question with spatial reasoning tags marking which type of reasoning each part requires, e.g., > *"From the perspective of the woman holding the remote control, which object is **\<POS\>on her right\</POS\>**?"* Tags: `<ATT>...</ATT>` (Attribute), `<POS>...</POS>` (Position), `<REL>...</REL>` (Relation) ## Data Structure ``` MultihopSpatial/ ├── README.md ├── teaser_2.png ├── data/ │ ├── multihop_test_4500.json │ ├── multihop_train_6791.json │ └── images/ │ ├── 000000303219.jpg │ ├── 000000022612.jpg │ ├── 01ce4fd6-197a-4792-8778-775b03780369_002114.jpeg │ └── ... ``` ## Usage ```python from datasets import load_dataset dataset = load_dataset("etri-vilab/MultihopSpatial") # Access splits test_data = dataset["test"] train_data = dataset["train"] # Example sample = test_data[0] print(sample["question"]) # "From the perspective of the woman holding the remote control, which object is on her right? ..." print(sample["answer"]) # "(c) frame of the reed picture" print(sample["bbox"]) # [52.86, 38.7, 70.95, 97.83] print(sample["hop"]) # "1hop" ``` ## Image Sources & License | Component | License | Source | |---|---|---| | **VQA Annotations** (questions, answers, bounding boxes) | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | MultihopSpatial (this work) | | **COCO Images** | [COCO Terms of Use](https://cocodataset.org/#termsofuse) | [MS-COCO](https://cocodataset.org/) | | **PACO-Ego4D Images** | [Ego4D License](https://ego4ddataset.com/ego4d-data/license/) | [PACO](https://github.com/facebookresearch/paco) / [Ego4D](https://ego4ddataset.com/) | > The images retain their original licenses. Our VQA annotations (questions, answers, bounding boxes, and metadata) are released under the Apache 2.0 License. ## Citation ```bibtex @article{lee2025multihopspatial, title={MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Models}, author={Lee, Youngwan and Jang, Soojin and Cho, Yoorhim and Lee, Seunghwan and Lee, Yong-Ju and Hwang, Sung Ju}, journal={arXiv preprint arXiv:2603.18892}, year={2025} } ``` ## Contact For questions or issues, please visit the [Project Page](https://youngwanlee.github.io/multihopspatial_private) or open an issue in this repository.

提供机构：

etri-vilab

5,000+

优质数据集

54 个

任务类型

进入经典数据集