VSI-100k
收藏魔搭社区2025-12-05 更新2025-11-15 收录
下载链接:
https://modelscope.cn/datasets/Oppo/VSI-100k
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center">
<!-- <h1 align="center"><img src="assets/logo.png" width="256"></h1> -->
<h1 align="center">Improved Visual-Spatial Reasoning via R1-Zero-Like Training</h1>
<p align="center">
<strong>Zhenyi Liao</strong></a>,
<strong>Qingsong Xie</strong></a>,
<strong>Yanhao Zhang</strong></a>,
<strong>Zijian Kong</strong></a>,
<strong>Haonan Lu</strong></a>,
<strong>Zhenyu Yang</strong></a>,
<strong>Zhijie Deng</strong></a>
</p>
<!-- 📖<a href="https://arxiv.org/abs/2504.00883">Paper</a> -->
<!-- 🤗<a href="https://huggingface.co/collections/laolao77/virft-datasets-67bc271b6f2833eccc0651df">
Datasets</a> | 🤗<a href="https://huggingface.co/papers/2503.01785">Daily Paper</a></h3> -->
<div align="center"></div>
<p align="center">
## 📅 News
- 🚀 [06/04/2025] We release VSI-100k.
- 🚀 [04/02/2025] We release our paper on <a href="https://arxiv.org/abs/2504.00883">arxiv</a>.
## 🌞 Highlights
<p>
🔔 We identify that the visual-spatial reasoning capacities of small- to medium-sized Qwen2-VL models cannot be activated via Chain of Thought (CoT) prompts.
🔔 We incorporate GRPO training for improved visual-spatial reasoning, using the carefully curated **VSI-100k** dataset.
🔔 With GRPO training, our vsGRPO-2B outperforms GPT-4o, and the vsGRPO-7B demonstrates performance comparable to the best open-source model, LLaVA-Video-Next-72B.
## 🤗 VSI-100k
To combat data scarcity, we build **VSI-100k**. Specifically, with the ScanNet 3D annotation information, we construct approximately 100k question-answer pairs for the training.
Here we release the raw data for the community. Specifically, we split the question types into six categories:
We are releasing the raw data for the community. The question types have been categorized into seven distinct categories:
- **Absolute Distance:** Given two unique objects in the scene, we provide the distance in meters between them.
- **Object Counting:** The total number of objects present in the entire scene.
- **Object Size:** The three dimensions of a unique object within the scene.
- **Relative Direction:** Given the location of the observer and their viewpoint, we provide the relative direction of the target concerning the observer. Note that there are three types of answers, distinguished according to the VSI-bench method.
- **Relative Distance:** For a given object, we list other objects in the scene from closest to farthest.
- **Room Size:** The area of the room in the scene is provided in square meters.
<!-- - **Appearance Order:** We sort the objects within the scene by their order of appearance. (To do) -->
<p align="center">
<h1 align="center">基于类R1-Zero训练的视觉空间推理能力提升</h1>
<p align="center">
<strong>廖振毅</strong>,
<strong>谢青松</strong>,
<strong>张彦浩</strong>,
<strong>孔梓健</strong>,
<strong>卢浩楠</strong>,
<strong>杨镇宇</strong>,
<strong>邓智杰</strong>
</p>
<div align="center"></div>
<p align="center">
## 📅 最新动态
- 🚀 [2025/06/04] 我们发布了VSI-100k数据集。
- 🚀 [2025/04/02] 我们的论文已在<a href="https://arxiv.org/abs/2504.00883">arxiv</a>上线。
## 🌞 研究亮点
<p>
🔔 我们发现中小型Qwen2-VL模型无法通过思维链(Chain of Thought, CoT)提示激活其视觉空间推理能力。
🔔 我们借助精心构建的**VSI-100k**数据集,引入GRPO训练以优化视觉空间推理性能。
🔔 经GRPO训练后的vsGRPO-2B模型性能超越GPT-4o,而vsGRPO-7B模型的表现可与当前最优开源模型LLaVA-Video-Next-72B相媲美。
</p>
## 🤗 VSI-100k数据集
为解决数据稀缺问题,我们构建了**VSI-100k**数据集。具体而言,依托ScanNet三维标注信息,我们生成了约10万条训练用问答对。
本次我们向社区发布原始数据集。我们将问答类型划分为七大类别:
- **绝对距离**:给定场景中的两个不同物体,输出二者之间以米为单位的距离值。
- **物体计数**:统计整个场景内的物体总数量。
- **物体尺寸**:输出场景中某一特定物体的三维尺寸参数。
- **相对方位**:给定观察者的位置与视角,输出目标物体相对于观察者的相对方位。注:答案共三类,依照VSI-bench方法进行区分。
- **相对距离**:针对指定物体,按距离由近及远的顺序列出场景内其余所有物体。
- **房间面积**:输出场景中房间的面积,单位为平方米。
<!-- - **外观顺序**:按物体在场景中的出现顺序进行排序。(待完成) -->
提供机构:
maas
创建时间:
2025-08-19



