LLDDSS/Spatial_Understanding

Name: LLDDSS/Spatial_Understanding
Creator: LLDDSS
Published: 2025-12-07 22:37:07
License: 暂无描述

Hugging Face2025-12-07 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/LLDDSS/Spatial_Understanding

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: idx dtype: int32 - name: type dtype: string - name: task dtype: string - name: image dtype: image - name: question dtype: string - name: choices list: string - name: answer dtype: string - name: prompt dtype: string - name: filename dtype: string - name: source dtype: string - name: source_dataset dtype: string - name: source_filename dtype: string - name: target_class dtype: string - name: target_size dtype: int32 - name: bbox list: list: float32 splits: - name: Whats_Up num_bytes: 802282940 num_examples: 820 - name: CV_Bench_Spatial num_bytes: 284815781 num_examples: 1850 - name: SEED_Bench_Spatial num_bytes: 740566967 num_examples: 1635 download_size: 1807258902 dataset_size: 1827665688 configs: - config_name: default data_files: - split: Whats_Up path: data/Whats_Up-* - split: CV_Bench_Spatial path: data/CV_Bench_Spatial-* - split: SEED_Bench_Spatial path: data/SEED_Bench_Spatial-* --- # Purpose **Spatial intelligence** is a fundamental component of both **Artificial General Intelligence (AGI)** and **Embodied AI**, encompassing multiple cognitive levels — **Perception**, **Understanding**, and **Extrapolation** (referring to the [work](https://www.techrxiv.org/users/992599/articles/1354538/master/file/data/Spatial_VLM_Survey_Techrxiv/Spatial_VLM_Survey_Techrxiv.pdf?inline=true#scrollbar=1&toolbar=1&statusbar=1&navpanes=1#)). We construct a **composite benchmark** derived from several prior works and this testbed is designed to measure the **Understanding** level of spatial intelligence of AI models within the given visual cues. ## Overview The benchmark integrates three sub-datasets — **What's Up**, **CV-Bench**, and **SEED-Bench** - What's Up Derived from [this work](https://arxiv.org/pdf/2310.19785), **What's Up** emphasizes **relative spatial positions** between two objects within a scene. It evaluates how accurately a VLM can reason about orientations and spatial relationships. - SEED-Bench (Spatial Subset) Adapted from [this work](https://arxiv.org/pdf/2307.16125), which proposes a comprehensive benchmark for general VLM evaluation. In this repository, we select only the **Spatial Relation** and **Instance Localization** subsets to specifically measure spatial reasoning performance under grounded visual cues. - CV-Bench (Spatial Subset) Based on [this work](https://arxiv.org/pdf/2406.16860), the original **CV-Bench** includes four tasks: *Counting*, *Relation*, *Depth*, and *Distance*. To focus exclusively on **spatial understanding**, this version retains only the *Relation*, *Depth*, and *Distance* tasks. ## Citation If you use this dataset in your research, please cite the original works linked above and acknowledge this composite benchmark. ``` @article{Liu_2025, title={Spatial Intelligence in Vision-Language Models: A Comprehensive Survey}, url={http://dx.doi.org/10.36227/techrxiv.176231405.57942913/v2}, DOI={10.36227/techrxiv.176231405.57942913/v2}, publisher={Institute of Electrical and Electronics Engineers (IEEE)}, author={Liu, Disheng and Liang, Tuo and Hu, Zhe and Peng, Jierui and Lu, Yiren and Xu, Yi and Fu, Yun and Yin, Yu}, year={2025}, month=nov } @article{kamath2023s, title={What's" up" with vision-language models? investigating their struggle with spatial reasoning}, author={Kamath, Amita and Hessel, Jack and Chang, Kai-Wei}, journal={arXiv preprint arXiv:2310.19785}, year={2023} } @article{li2023seed, title={Seed-bench: Benchmarking multimodal llms with generative comprehension}, author={Li, Bohao and Wang, Rui and Wang, Guangzhi and Ge, Yuying and Ge, Yixiao and Shan, Ying}, journal={arXiv preprint arXiv:2307.16125}, year={2023} } @article{tong2024cambrian, title={{Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs}}, author={Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng, and Iyer, Adithya and Pan, Xichen and Wang, Austin and Fergus, Rob and LeCun, Yann and Xie, Saining}, journal={arXiv preprint arXiv:2406.16860}, year={2024} } ```

数据集信息：特征项： - 名称：idx，数据类型：int32 - 名称：type，数据类型：字符串 - 名称：task，数据类型：字符串 - 名称：image，数据类型：图像 - 名称：question，数据类型：字符串 - 名称：choices，数据类型：字符串列表 - 名称：answer，数据类型：字符串 - 名称：prompt，数据类型：字符串 - 名称：filename，数据类型：字符串 - 名称：source，数据类型：字符串 - 名称：source_dataset，数据类型：字符串 - 名称：source_filename，数据类型：字符串 - 名称：target_class，数据类型：字符串 - 名称：target_size，数据类型：int32 - 名称：bbox，数据类型：浮点数列表的列表数据集划分： - 划分名称：Whats_Up，字节大小：802282940，样本数量：820 - 划分名称：CV_Bench_Spatial，字节大小：284815781，样本数量：1850 - 划分名称：SEED_Bench_Spatial，字节大小：740566967，样本数量：1635 下载总大小：1807258902字节，数据集总存储大小：1827665688字节配置项： - 配置名称：default，数据文件： - 划分：Whats_Up，路径：data/Whats_Up-* - 划分：CV_Bench_Spatial，路径：data/CV_Bench_Spatial-* - 划分：SEED_Bench_Spatial，路径：data/SEED_Bench_Spatial-* --- # 研究背景与目标 **空间智能**是**通用人工智能（AGI）**与**具身智能**的核心组成部分，涵盖感知、理解与推演三个认知层级（相关研究参见[此处文献](https://www.techrxiv.org/users/992599/articles/1354538/master/file/data/Spatial_VLM_Survey_Techrxiv/Spatial_VLM_Survey_Techrxiv.pdf?inline=true#scrollbar=1&toolbar=1&statusbar=1&navpanes=1#)）。我们构建了一个**复合基准测试集**，整合多项现有工作成果，该测试集旨在针对给定视觉线索，评测AI模型的空间智能理解能力。 ## 基准概览该复合基准整合了三个子数据集——**What's Up**、**CV-Bench**与**SEED-Bench**： - What's Up 该数据集源自[此项研究](https://arxiv.org/pdf/2310.19785)，核心聚焦于场景内两个物体间的相对空间位置关系，用于评测视觉语言模型（Vision-Language Model, VLM）对物体方位与空间关联的推理准确性。 - SEED-Bench（空间子集）该子集改编自[此项通用视觉语言模型评测基准研究](https://arxiv.org/pdf/2307.16125)。本仓库仅选取其中的**空间关系**与**实例定位**子集，以专门评测模型在锚定视觉线索（grounded visual cues）下的空间推理性能。 - CV-Bench（空间子集）该子集基于[此项研究](https://arxiv.org/pdf/2406.16860)，原始CV-Bench包含四类任务：计数、关系、深度与距离。为聚焦空间理解能力，本版本仅保留关系、深度与距离三类任务。 ## 引用规范若您在研究中使用本数据集，请引用上述提及的原始文献，并注明本复合基准测试集。 @article{Liu_2025, title={Spatial Intelligence in Vision-Language Models: A Comprehensive Survey}, url={http://dx.doi.org/10.36227/techrxiv.176231405.57942913/v2}, DOI={10.36227/techrxiv.176231405.57942913/v2}, publisher={Institute of Electrical and Electronics Engineers (IEEE)}, author={Liu, Disheng and Liang, Tuo and Hu, Zhe and Peng, Jierui and Lu, Yiren and Xu, Yi and Fu, Yun and Yin, Yu}, year={2025}, month=nov } @article{kamath2023s, title={What's" up" with vision-language models? investigating their struggle with spatial reasoning}, author={Kamath, Amita and Hessel, Jack and Chang, Kai-Wei}, journal={arXiv preprint arXiv:2310.19785}, year={2023} } @article{li2023seed, title={Seed-bench: Benchmarking multimodal llms with generative comprehension}, author={Li, Bohao and Wang, Rui and Wang, Guangzhi and Ge, Yuying and Ge, Yixiao and Shan, Ying}, journal={arXiv preprint arXiv:2307.16125}, year={2023} } @article{tong2024cambrian, title={{Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs}}, author={Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng, and Iyer, Adithya and Pan, Xichen and Wang, Austin and Fergus, Rob and LeCun, Yann and Xie, Saining}, journal={arXiv preprint arXiv:2406.16860}, year={2024} }

提供机构：

LLDDSS

5,000+

优质数据集

54 个

任务类型

进入经典数据集