ViCA-322K

Name: ViCA-322K
Creator: maas
Published: 2026-01-09 15:22:46
License: 暂无描述

魔搭社区2026-01-09 更新2025-08-23 收录

下载链接：

https://modelscope.cn/datasets/nkkbr/ViCA-322K

下载链接

链接失效反馈

官方服务：

资源简介：

# ViCA-322K: A Dataset for Visuospatial Cognition in Real-World Indoor Videos [![GitHub](https://img.shields.io/badge/GitHub-ViCA2-181717?logo=github&logoColor=white)](https://github.com/nkkbr/ViCA) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-ViCA2-blue)](https://huggingface.co/nkkbr/ViCA2) [![arXiv](https://img.shields.io/badge/arXiv-2505.12363-B31B1B?logo=arxiv&link=https://arxiv.org/abs/2505.12363)](https://arxiv.org/abs/2505.12363) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-ViCA-blue)](https://huggingface.co/nkkbr/ViCA) [![arXiv](https://img.shields.io/badge/arXiv-2505.12312-B31B1B?logo=arxiv&link=https://arxiv.org/abs/2505.12312)](https://arxiv.org/abs/2505.12312) ## Quickstart You can load our dataset using the following code: ```python from datasets import load_dataset vica_322k_arkit_base = load_dataset("nkkbr/ViCA-322K", "arkitscenes_base") ``` Replace `"arkitscenes_base"` with any of the following configuration names depending on your need: ```python ["arkitscenes_base", "arkitscenes_complex", "scannet_base", "scannet_complex", "scannetpp_base", "scannetpp_complex"] ``` ## Overview **ViCA-322K** is a large-scale **video-question-answering dataset** tailored for training multimodal models with **visuospatial cognitive capabilities**. It contains **322,003 high-quality QA pairs** spanning 12 spatial cognition tasks, drawn from three RGB-D video datasets: **ARKitScenes**, **ScanNet**, and **ScanNet++**. ViCA-322K includes both: * **Structured metadata-grounded questions** for precise supervision * **Observation-based questions** for complex, language-grounded visuospatial cognition All videos depict diverse real-world indoor environments, and we ensure **no overlap** with commonly used evaluation splits (e.g., VSI-Bench). ## Video Sources The dataset is constructed from the **training splits** of: * [ARKitScenes](https://github.com/apple/ARKitScenes) * [ScanNet](http://www.scan-net.org/) * [ScanNet++](https://kaldir.vc.in.tum.de/scannetpp/) Each video in ViCA-322K is paired with either structured 3D metadata (used for generating base spatial cognition tasks), or natural-language question-answer annotations (for language-grounded reasoning tasks), enabling training across multiple levels of visuospatial cognition. ## Dataset Composition ![ViCA-322K Structure](assets/composition.png) ViCA-322K is divided into: ### 1. **Base Data** Automatically derived from structured **3D bounding box metadata**, this subset covers six core spatial cognition tasks: * `object_count.json` * `object_relative_distance.json` * `object_size_estimation.json` * `object_abs_distance.json` * `obj_appearance_order.json` * `room_size.json` In **ARKitScenes**, we include an additional **experimental split**: #### Triangular Positional Relationship (ARKitScenes only) * **`triangular_positional_relationship.json` (21,707 entries)** * Each question asks the model to describe the **side lengths and angles** of a triangle formed by three specified objects. <details> <summary>Example</summary> ```json { "video": "arkitscenes/41125972.mp4", "conversations": [ { "from": "human", "value": "<image>\nProvide the structural information of the triangle formed by stool, table, and tv_monitor, specifying both side lengths and angles." }, { "from": "gpt", "value": "The triangle has sides of 3.7, 3.0, and 2.0 meters, with angles measuring 55, 32, and 94 degrees at their respective points." } ] } ``` </details> #### Answer Format Diversity in Base Data While most base data questions are grounded in **numerical metadata**, the **answer formats vary**: * **Numerical values** (e.g., room size): ```json { "video": "arkitscenes/42897967.mp4", "conversations": [ { "from": "human", "value": "<image>\nWhat is the approximate floor area of this room in square meters? If multiple rooms are depicted, estimate their total combined size." }, { "from": "gpt", "value": "21.05" } ] } ``` * **Multiple-choice selection** (e.g., appearance order): ```json { "video": "arkitscenes/42445088.mp4", "conversations": [ { "from": "human", "value": "<image>\nIn what order do the categories make their first appearance in the video: refrigerator, oven, chair?\nA. refrigerator, oven, chair\nB. chair, oven, refrigerator\nC. chair, refrigerator, oven\nD. refrigerator, chair, oven\nPlease provide your answer by stating the letter followed by the full option." }, { "from": "gpt", "value": "C. chair, refrigerator, oven" } ] } ``` ### 2. **Complex Spatial Cognition** This subset contains open-ended, language-grounded tasks that require interpreting spatial layouts and relationships based solely on **video content** (no metadata at inference time). #### Six task types: * `conversation.json` – multi-turn spatial dialogues * `important_daily_necessities.json` – reasoning about key objects for daily use * `furniture.json` – furniture addition reasoning * `usage.json` – scenario-based spatial planning * `wheelchair_user.json` – accessibility reasoning * `spatial_description.json` – freeform situated spatial descriptions #### Data Format Examples * **Multi-turn dialogue** with loss computed **only on GPT’s responses**: ```json { "video": "arkitscenes/47430132.mp4", "conversations": [ {"from": "human","value": "<image>\nIt looks like we're in a kitchen area. Can you tell me more about the major appliances we can see?"}, {"from": "gpt","value": "Certainly! The refrigerator is located towards the left side of the kitchen space. It’s quite tall, making it a prominent feature. Next to it, there’s the stove, positioned slightly above it and closer in line, allowing for easy access when cooking. The oven is also in that area, nearby the stove and in front of another cabinet, which indicates a compact but efficient cooking setup."}, {"from": "human","value": "That’s a good point! How is the sink situated in relation to the other appliances?"}, {"from": "gpt","value": "The sink is situated towards the middle-right of the kitchen, above the oven and stove area. This arrangement allows for convenient flow when moving between the sink and cooking appliances, making it easier to handle tasks like washing vegetables or dishes while preparing meals."}, {"from": "human","value": "Speaking of cabinets, there seem to be quite a few in the room. Where are they mainly located?"}, {"from": "gpt","value": "Yes, there are several cabinets throughout the kitchen. There’s a tall cabinet towards the rear left and another cabinet positioned right above the stove and oven. This suggests good utilization of vertical space for storage. Additionally, there are more cabinets along the wall above the sink and refrigerator areas, likely providing ample storage for kitchen essentials."}, {"from": "human","value": "What about the table? How does it fit into the kitchen’s layout?"}, {"from": "gpt","value": "The table is located centrally in the kitchen, which is a strategic position for meal prep or dining. It's rectangular and slightly elevated, allowing for comfortable seating and movement around the area. Given its placement, it seems to serve as a multipurpose space, functional for both food preparation and casual dining."}, {"from": "human","value": "I noticed stools in the area too. How are they arranged around the table?"}, {"from": "gpt","value": "There are two stools positioned next to the table. Their close proximity suggests they are intended for use alongside the table, making it easy to sit and enjoy a meal or drink. Their placement implies they enhance the functionality of the table, providing additional seating without obstructing the flow of movement in the kitchen."}, {"from": "human","value": "Overall, do the objects in the kitchen seem to be practically arranged?"}, {"from": "gpt","value": "Yes, the layout appears highly functional. The placement of the refrigerator, stove, oven, and sink creates a logical workflow for cooking and cleaning. The table and stools are well-placed for dining, giving the kitchen an inviting atmosphere. The arrangement facilitates accessibility while allowing enough space for movement, which contributes to an overall efficient and user-friendly design."} ] } ``` * **Single-turn detailed spatial description**: ```json { "video": "arkitscenes/42898786.mp4", "conversations": [ { "from": "human", "value": "<image>\nImagine that you are inside the space shown in this video. Describe your surroundings in detail, ensuring you account for the objects’ positions, sizes, and their spatial relationships." }, { "from": "gpt", "value": "I find myself in a cozy living room that feels welcoming and warm. To my left, I can see a large, plush brown sofa adorned with several cushions, including some in shades of gold and a lively green. Directly in front of me, a bright yellow tufted ottoman breaks the linearity of the space, sitting atop a colorful striped rug that adds vibrancy to the wooden floor. To my right, an elegant armchair with a rich leather finish complements the aesthetic, positioned near a fireplace where a few decorative items are nestled. The room is well-lit by a standing lamp in the corner that casts a soft glow, while a television mounted on the wall reflects a modern touch juxtaposed with various framed photographs and trophies arranged neatly on a wooden surface beneath it. Near the window, a beautifully decorated Christmas tree adds a festive spirit, its ornaments twinkling gently, while gifts are carefully placed under its boughs. The overall ambiance is both relaxing and inviting, making it the perfect setting for gatherings or quiet reflection." } ] } ``` All QA pairs are generated using **GPT-4o-mini**, with 10 paraphrased versions per prompt type to encourage linguistic diversity. ## Data Statistics | Subset | QA Pairs | Description | | -------------------------------- | ----------- | -------------------------------------------------------- | | Base Data | 281,359 | From structured 3D metadata (6 spatial cognition tasks) | | ├─ Triangular Positional (ARKit) | 21,707 | Experimental geometry-based cognition | | Complex Spatial Cognition | 40,644 | Language-grounded cognition (multi-turn and descriptive) | | **Total** | **322,003** | | ## File Structure ``` ViCA-322K/ ├── arkitscenes/ │ ├── base/ │ │ ├── object_count.json │ │ ├── object_relative_distance.json │ │ ├── object_size_estimation.json │ │ ├── object_abs_distance.json │ │ ├── obj_appearance_order.json │ │ ├── room_size.json │ │ └── triangular_positional_relationship.json ← ARKit exclusive │ └── complex/ │ ├── conversation.json │ ├── furniture.json │ ├── important_daily_necessities.json │ ├── spatial_description.json │ ├── usage.json │ └── wheelchair_user.json ├── scannet/ │ ├── base/ │ │ ├── object_count.json │ │ ├── object_relative_distance.json │ │ ├── object_size_estimation.json │ │ ├── object_abs_distance.json │ │ ├── obj_appearance_order.json │ │ └── room_size.json │ └── complex/ │ ├── conversation.json │ ├── furniture.json │ ├── important_daily_necessities.json │ ├── spatial_description.json │ ├── usage.json │ └── wheelchair_user.json ├── scannetpp/ │ ├── base/ │ │ ├── object_count.json │ │ ├── object_relative_distance.json │ │ ├── object_size_estimation.json │ │ ├── object_abs_distance.json │ │ ├── obj_appearance_order.json │ │ └── room_size.json │ └── complex/ │ ├── conversation.json │ ├── furniture.json │ ├── important_daily_necessities.json │ ├── spatial_description.json │ ├── usage.json │ └── wheelchair_user.json ``` ## Usage ViCA-322K is suitable for: * Pretraining/fine-tuning **vision-language models** This dataset was used to train our [ViCA-7B](https://huggingface.co/nkkbr/ViCA)/[ViCA2-7B](https://huggingface.co/nkkbr/ViCA2), a multimodal model for **fine-grained visuospatial cognition**. ## License Released under the [Creative Commons Attribution-NonCommercial 4.0 International License](https://creativecommons.org/licenses/by-nc/4.0/). ## Related Resources * [ViCA-7B Model](https://huggingface.co/nkkbr/ViCA) * [ViCA2-7B Model](https://huggingface.co/nkkbr/ViCA2) * [ViCA-thinking-2.68k Dataset](https://huggingface.co/datasets/nkkbr/ViCA-thinking-2.68k) ## Citation If you find our work helpful, we would appreciate it if you cite the following papers. ```bibtex @misc{feng2025vica2, title={Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts}, author={Feng, Qi}, publisher={arXiv:2505.12363}, year={2025}, } ``` ```bibtex @misc{feng2025vica, title={Visuospatial Cognitive Assistant}, author={Feng, Qi}, publisher={arXiv:2505.12312}, year={2025}, } ```

# ViCA-322K: 面向真实室内视频视觉空间认知的数据集 [![GitHub](https://img.shields.io/badge/GitHub-ViCA2-181717?logo=github&logoColor=white)](https://github.com/nkkbr/ViCA) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-ViCA2-blue)](https://huggingface.co/nkkbr/ViCA2) [![arXiv](https://img.shields.io/badge/arXiv-2505.12363-B31B1B?logo=arxiv&link=https://arxiv.org/abs/2505.12363)](https://arxiv.org/abs/2505.12363) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-ViCA-blue)](https://huggingface.co/nkkbr/ViCA) [![arXiv](https://img.shields.io/badge/arXiv-2505.12312-B31B1B?logo=arxiv&link=https://arxiv.org/abs/2505.12312)](https://arxiv.org/abs/2505.12312) ## 快速开始你可以通过以下代码加载本数据集： python from datasets import load_dataset vica_322k_arkit_base = load_dataset("nkkbr/ViCA-322K", "arkitscenes_base") 根据你的需求，将`"arkitscenes_base"`替换为以下任一配置名称即可： python ["arkitscenes_base", "arkitscenes_complex", "scannet_base", "scannet_complex", "scannetpp_base", "scannetpp_complex"] ## 数据集概览 **ViCA-322K** 是一款大规模**视频问答（Video Question Answering）** 数据集，专为训练具备**视觉空间认知**能力的多模态模型定制。其包含**322,003条高质量问答对**，涵盖12项空间认知任务，数据源自三个RGB-D视频数据集：**ARKitScenes**、**ScanNet** 与 **ScanNet++**。 ViCA-322K 包含两类数据： * **基于结构化元数据的问答对**，用于提供精准监督 * **基于观测的问答对**，用于开展复杂的、语言锚定的视觉空间认知研究所有视频均展示多样化的真实室内环境，且我们确保该数据集与常用评估划分（如VSI-Bench）**无重叠**。 ## 视频数据源本数据集构建自以下数据集的**训练划分**： * [ARKitScenes](https://github.com/apple/ARKitScenes) * [ScanNet](http://www.scan-net.org/) * [ScanNet++](https://kaldir.vc.in.tum.de/scannetpp/) ViCA-322K 中的每个视频均配有结构化3D元数据（用于生成基础空间认知任务）或自然语言问答标注（用于语言锚定的推理任务），支持在多个视觉空间认知层级上开展训练。 ## 数据集构成 ![ViCA-322K 结构](assets/composition.png) ViCA-322K 分为以下两大部分： ### 1. **基础数据** 该子集自动从结构化**3D包围盒元数据**衍生而来，涵盖6项核心空间认知任务： * `object_count.json`（物体计数） * `object_relative_distance.json`（物体相对距离） * `object_size_estimation.json`（物体尺寸估计） * `object_abs_distance.json`（物体绝对距离） * `obj_appearance_order.json`（物体出现顺序） * `room_size.json`（房间尺寸）在 **ARKitScenes** 中，我们额外包含一个**实验性划分**： #### 三角位置关系（仅ARKitScenes支持） * **`triangular_positional_relationship.json`（共21,707条数据）** * 每个问题要求模型描述由三个指定物体构成的三角形的**边长与角度**。 <details> <summary>示例</summary> json { "video": "arkitscenes/41125972.mp4", "conversations": [ { "from": "human", "value": "<image> 请提供由凳子、桌子与电视显示器构成的三角形的结构信息，同时指明各边长度与对应角度。" }, { "from": "gpt", "value": "该三角形的三边长度分别为3.7米、3.0米与2.0米，对应顶点处的角度分别为55度、32度与94度。" } ] } </details> #### 基础数据的答案格式多样性尽管多数基础数据问答对基于**数值元数据**生成，但**答案格式多样**： * **数值结果**（如房间尺寸）： json { "video": "arkitscenes/42897967.mp4", "conversations": [ { "from": "human", "value": "<image> 请估算该房间的近似建筑面积（单位：平方米）。若画面中包含多个房间，请估算其总建筑面积。" }, { "from": "gpt", "value": "21.05" } ] } * **多项选择**（如物体出现顺序）： json { "video": "arkitscenes/42445088.mp4", "conversations": [ { "from": "human", "value": "<image> 在本视频中，以下三类物体的首次出现顺序为：冰箱、烤箱、椅子？ A. 冰箱、烤箱、椅子 B. 椅子、烤箱、冰箱 C. 椅子、冰箱、烤箱 D. 冰箱、椅子、烤箱请通过字母加完整选项的形式作答。" }, { "from": "gpt", "value": "C. 椅子、冰箱、烤箱" } ] } ### 2. **复杂空间认知** 该子集包含开放式、语言锚定的任务，要求模型仅基于**视频内容（推理时无需元数据）** 解读空间布局与关系。 #### 6种任务类型： * `conversation.json`（多轮空间对话） * `important_daily_necessities.json`（日常必需关键物体推理） * `furniture.json`（家具添加推理） * `usage.json`（基于场景的空间规划） * `wheelchair_user.json`（无障碍空间推理） * `spatial_description.json`（自由形式的情境化空间描述） #### 数据格式示例 * **仅对GPT回复计算损失的多轮对话**： json { "video": "arkitscenes/47430132.mp4", "conversations": [ {"from": "human","value": "<image> 我们似乎身处厨房区域。你能详细介绍一下我们能看到的主要家电吗？"}, {"from": "gpt","value": "当然！冰箱位于厨房空间的左侧，体型高大，是显眼的核心设施。它旁边是炉灶，位置略高于冰箱且更靠内，方便烹饪时操作。烤箱也在该区域，紧邻炉灶且位于另一橱柜前方，整体布局紧凑高效。"}, {"from": "human","value": "说得不错！水槽相对于其他家电的位置如何？"}, {"from": "gpt","value": "水槽位于厨房的中右侧，处于炉灶和烤箱区域的上方。这种布局让清洗蔬菜、餐具等操作与烹饪流程衔接顺畅，提升了备餐效率。"}, {"from": "human","value": "说到橱柜，房间里似乎有不少橱柜。它们主要分布在哪里？"}, {"from": "gpt","value": "没错，厨房各处都布置了橱柜。左后方有一个高柜，炉灶和烤箱上方也设有橱柜，充分利用了垂直空间进行储物。此外，水槽和冰箱上方的墙面也安装了橱柜，可存放厨房用品。"}, {"from": "human","value": "那桌子呢？它在厨房布局中处于什么位置？"}, {"from": "gpt","value": "桌子位于厨房中央，是兼具备餐与用餐功能的核心区域。它为长方形且高度适中，方便就座和活动。其位置设计合理，既满足了日常使用需求，又不会阻碍厨房的通行。"}, {"from": "human","value": "我还注意到区域内有凳子。它们在桌子周围是如何布置的？"}, {"from": "gpt","value": "桌子旁放置了两张凳子，位置紧邻桌子，用于搭配桌子使用，方便用餐或休息。它们的布置既增加了座位空间，又不会影响厨房的通行流畅性。"}, {"from": "human","value": "总体而言，厨房内的物体布局是否实用？"}, {"from": "gpt","value": "是的，整体布局非常实用。冰箱、炉灶、烤箱和水槽的位置形成了逻辑连贯的烹饪与清洁流程，桌子和凳子的布置也便于用餐，整体空间既便于通行又保证了使用效率，设计友好且高效。"} ] } * **单轮详细空间描述**： json { "video": "arkitscenes/42898786.mp4", "conversations": [ { "from": "human", "value": "<image> 请想象你身处本视频展示的空间中，详细描述你的周围环境，包括物体的位置、尺寸与空间关系。" }, { "from": "gpt", "value": "我正身处一个温馨舒适的客厅。左侧是一张宽大的棕色布艺沙发，搭配了多个靠垫，包括金色和鲜亮绿色的款式。正前方是一个明亮的黄色簇绒脚凳，放置在色彩斑斓的条纹地毯上，为木质地板增添了活力。右侧是一张质感优雅的皮质扶手椅，靠近摆放着装饰品的壁炉。房间角落的落地灯散发着柔和的光线，墙上挂着的电视与下方木质台面上整齐摆放的相框和奖杯，营造出现代与温馨结合的氛围。窗边的装饰圣诞树点缀着闪亮的饰品，树下摆放着精心布置的礼物，为空间增添了节日气息。整体氛围放松宜人，是聚会或静心休憩的绝佳场所。" } ] } 所有问答对均使用**GPT-4o-mini**生成，每个提示类型包含10个释义版本以提升语言多样性。 ## 数据统计 | 子集名称 | 问答对数量 | 描述说明 | | -------------------------------- | ----------- | -------------------------------------------------------- | | 基础数据 | 281,359 | 源自结构化3D元数据（6项空间认知任务） | | ├─ 三角位置关系（ARKit） | 21,707 | 实验性几何认知任务 | | 复杂空间认知 | 40,644 | 语言锚定认知（多轮对话与描述型） | | **总计** | **322,003** | | ## 文件结构 ViCA-322K/ ├── arkitscenes/ │ ├── base/ │ │ ├── object_count.json │ │ ├── object_relative_distance.json │ │ ├── object_size_estimation.json │ │ ├── object_abs_distance.json │ │ ├── obj_appearance_order.json │ │ ├── room_size.json │ │ └── triangular_positional_relationship.json ← ARKit专属 │ └── complex/ │ ├── conversation.json │ ├── furniture.json │ ├── important_daily_necessities.json │ ├── spatial_description.json │ ├── usage.json │ └── wheelchair_user.json ├── scannet/ │ ├── base/ │ │ ├── object_count.json │ │ ├── object_relative_distance.json │ │ ├── object_size_estimation.json │ │ ├── object_abs_distance.json │ │ ├── obj_appearance_order.json │ │ └── room_size.json │ └── complex/ │ ├── conversation.json │ ├── furniture.json │ ├── important_daily_necessities.json │ ├── spatial_description.json │ ├── usage.json │ └── wheelchair_user.json ├── scannetpp/ │ ├── base/ │ │ ├── object_count.json │ │ ├── object_relative_distance.json │ │ ├── object_size_estimation.json │ │ ├── object_abs_distance.json │ │ ├── obj_appearance_order.json │ │ └── room_size.json │ └── complex/ │ ├── conversation.json │ ├── furniture.json │ ├── important_daily_necessities.json │ ├── spatial_description.json │ ├── usage.json │ └── wheelchair_user.json ## 使用场景 ViCA-322K 适用于： * **视觉语言模型（Vision-Language Model）** 的预训练与微调本数据集曾用于训练我们的[ViCA-7B](https://huggingface.co/nkkbr/ViCA)/[ViCA2-7B](https://huggingface.co/nkkbr/ViCA2)，一款面向**细粒度视觉空间认知**的多模态模型。 ## 许可协议本数据集采用**知识共享署名-非商业性使用4.0国际许可协议（Creative Commons Attribution-NonCommercial 4.0 International License）** 发布。 ## 相关资源 * [ViCA-7B 模型](https://huggingface.co/nkkbr/ViCA) * [ViCA2-7B 模型](https://huggingface.co/nkkbr/ViCA2) * [ViCA-thinking-2.68k 数据集](https://huggingface.co/datasets/nkkbr/ViCA-thinking-2.68k) ## 引用格式若您觉得本工作对您有所帮助，欢迎引用以下论文： bibtex @misc{feng2025vica2, title={Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts}, author={Feng, Qi}, publisher={arXiv:2505.12363}, year={2025}, } bibtex @misc{feng2025vica, title={Visuospatial Cognitive Assistant}, author={Feng, Qi}, publisher={arXiv:2505.12312}, year={2025}, }

提供机构：

maas

创建时间：

2025-08-19

搜集汇总

数据集介绍