SpaceThinker
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/remyxai/SpaceThinker
下载链接
链接失效反馈官方服务:
资源简介:
# SpaceThinker Dataset

## Usage
[](https://colab.research.google.com/drive/1R64daHgR50GnxH3yn7mcs8rnldWL1ZxF?usp=sharing)
Try training a LLaVA-style VLM using the **SpaceThinker Dataset**
### Enhanced Quantitative Spatial Reasoning with Test-Time Compute
The **SpaceThinker** dataset is created using [VQASynth](https://github.com/remyxai/VQASynth) to synthesize spatial reasoning traces from a subset of images
in the localized narratives split of [the cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron).
## Data Samples
| <img src="https://github.com/remyxai/VQASynth/blob/main/assets/warehouse_sample_1.jpeg?raw=true" alt="sample_1" style="max-height: 350px;" > | <img src="https://github.com/remyxai/VQASynth/blob/main/assets/warehouse_sample_2.jpeg?raw=true" alt="sample_2" style="max-height: 350px;"> | <img src="https://github.com/remyxai/VQASynth/blob/main/assets/warehouse_sample_3.jpeg?raw=true" alt="sample_3" style="max-height: 350px;"> |
|----------|-------------|--------|
| **Prompt:** How far is the man in the red hat from the pallet of boxes in feet? | **Prompt:** How far is the Goal from the penalty taker in this image? | **Prompt:** Approximately how many meters apart are the chair and bookshelf? |
| **Reasoning:** To determine how far the man in the red hat is from the pallet of boxes, I need to consider the spatial arrangement in the warehouse. The man is walking on the floor, and there's a visible gap between him and the pallet. First, I'll estimate the size of the pallet. A standard pallet is typically around 4 feet wide and 8 feet long. Given that the man is standing near the edge of the pallet, he might be about 2–3 feet away from the front edge of the pallet. Next, I'll consider the distance from the pallet to the man. The man appears to be walking along the aisle, which seems to be about 6–7 feet wide. If he is standing at the edge of the pallet, he would likely be about 2–3 feet from the front edge of the pallet. Combining these observations, the man in the red hat is approximately 2–3 feet from the pallet of boxes. | **Reasoning:** First, I observe the scene where a player in a red uniform stands near the goal. The goal itself is a standard soccer goalpost, which typically extends 2.5 meters beyond the line of the goalposts. The player appears to be standing close to the goal, but not directly at the goal line. To estimate the distance, I consider that the player's position relative to the goal suggests he is likely standing closer than 2.5 meters away. Given the typical setup of a penalty kick scenario, the player would be positioned further back from the goal than the goal itself. By visual estimation, the player seems to be standing approximately 10–15 meters from the goal. This range aligns with the common distance for a penalty kick, considering the player's stance and the goal's structure. Thus, the distance from the goal to the penalty taker is likely around 10–15 meters. | **Reasoning:** To determine the distance between the chair and the bookshelf, I'll consider the typical dimensions of these objects. A standard office chair is around 60–70 cm tall, and a bookshelf can be anywhere from 1.2 to 1.8 meters tall. Given that the chair is likely placed on a desk or floor level, the height difference between the chair and the top of the bookshelf would be approximately 1 meter. Assuming the bookshelf is at least 1 meter high, and the chair is at ground level, the vertical distance between them is indeed 1 meter. The horizontal distance might vary depending on the layout, but given the room's dimensions and the placement of the chair relative to the bookshelf, the total distance between the chair and the top of the bookshelf is likely around 1 meter. Therefore, the chair and bookshelf are approximately 1 meter apart. |
| **Answer:** 2–3 feet | **Answer:** 10–15 meters | **Answer:** 1 meter |
## Data Creation
The VQASynth reasoning trace synthesis pipeline relies on a 3D scene graph extracted for each image.
- Describe and localize objects with Molmo point prompting SAM2.
- VGGT is applied to the image for 3D scene reconstruction.
- Synthesize consistent spatial reasoning traces with templated QA pairs as context for LLMs.

The **SpaceThinker** dataset includes over 12K samples synthesized using VQASynth on a subset of images in the localized narratives split of [the cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron).
**SpaceThinker** is formatted similar to the [Llama-Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) to
toggle reasoning for compatibility with the [Llama-3.1-Nemotron-Nano-8B-v1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1) base LLM.
The model builds upon the ideas from [SpatialVLM (Chen et al., 2024)](https://spatial-vlm.github.io/), introducing synthetic reasoning traces grounded on a 3D scene reconstruction pipeline using Molmo, VGGT, SAM2.
**Dataset Summary**
- ~12K synthetic spatial reasoning traces
- Question types: spatial relations (distances (units), above, left-of, contains, closest to)
- Format: image (RGB) + question + answer with reasoning traces
- Dataset: [remyxai/SpaceThinker](https://huggingface.co/datasets/remyxai/SpaceThinker)
- Code: [Synthetize Spatial Reasoning Traces with VQASynth](https://github.com/remyxai/VQASynth)
🧠 Check out the latest VLM trained using the **SpaceThinker** dataset: [SpaceThinker-Qwen2.5-VL-3B](https://huggingface.co/remyxai/SpaceThinker-Qwen2.5VL-3B)
📖 Read more about [thinking VLM](https://huggingface.co/spaces/open-r1/README/discussions/10)
## Limitations
- Performance may degrade in cluttered environments or camera perspective.
- This dataset was generated using synthetic reasoning over an internet image dataset.
- Multimodal biases inherent to the underlying dataset (local narratives split of the Cauldron) may persist.
> Users are encouraged to evaluate outputs critically and consider adapting the synthesis pipeline for domain-specific safety and performance.
## Citation
```
@misc{VQASynth,
author = {remyxai},
title = {VQASynth},
year = {2024},
note = {GitHub repository},
url = {https://github.com/remyxai/VQASynth/tree/main}
}
```
# SpaceThinker 数据集

## Usage
[](https://colab.research.google.com/drive/1R64daHgR50GnxH3yn7mcs8rnldWL1ZxF?usp=sharing)
尝试使用**SpaceThinker 数据集**训练LLaVA风格的视觉语言模型(VLM)
### 基于测试时计算的增强型量化空间推理
**SpaceThinker 数据集**依托[VQASynth](https://github.com/remyxai/VQASynth)构建,从[the cauldron数据集](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron)的本地化叙事拆分子集的图像中合成空间推理轨迹。
## 数据样本
| <img src="https://github.com/remyxai/VQASynth/blob/main/assets/warehouse_sample_1.jpeg?raw=true" alt="sample_1" style="max-height: 350px;" > | <img src="https://github.com/remyxai/VQASynth/blob/main/assets/warehouse_sample_2.jpeg?raw=true" alt="sample_2" style="max-height: 350px;"> | <img src="https://github.com/remyxai/VQASynth/blob/main/assets/warehouse_sample_3.jpeg?raw=true" alt="sample_3" style="max-height: 350px;"> |
|----------|-------------|--------|
| **提示:** 戴红帽子的男子与纸箱托盘之间的距离为多少英尺? | **提示:** 本图中球门与点球手之间的距离为多少? | **提示:** 椅子与书架之间的间距约为多少米? |
| **推理过程:** 为计算戴红帽男子与纸箱托盘的距离,需结合仓库内的空间布局进行分析。男子正行走在地面上,其与托盘间存在明显间隙。首先估算托盘尺寸:标准托盘的宽度约为4英尺,长度约为8英尺。鉴于男子站在托盘边缘附近,其与托盘前沿的距离约为2~3英尺。其次,观察男子行走的通道宽度约为6~7英尺,结合其站位可进一步确认,该男子与纸箱托盘的距离约为2~3英尺。 | **推理过程:** 首先观察场景:一名身着红色球衣的球员站在球门附近。标准足球球门的门柱线外通常延伸2.5米。该球员站位靠近球门但未处于球门线上。结合点球场景的典型布局,球员站位应位于球门后方一定距离。通过视觉估算,该球员与球门的距离约为10~15米,这一范围符合点球距离的常规设置,与球员姿态及球门结构相符。因此,球门与点球手的距离约为10~15米。 | **推理过程:** 为计算椅子与书架的间距,需先明确两类物体的典型尺寸。标准办公椅高度约为60~70厘米,书架高度通常为1.2~1.8米。假设椅子放置在桌面或地面上,其与书架顶部的垂直高度差约为1米。考虑书架高度至少为1米,椅子位于地面,二者垂直间距约为1米。结合房间布局与物体相对位置,椅子与书架的总间距约为1米。因此,二者间距约为1米。 |
| **答案:** 2–3英尺 | **答案:** 10–15米 | **答案:** 1米 |
## 数据构建
VQASynth推理轨迹合成流水线依托针对每张图像提取的三维场景图实现:
- 借助Molmo点提示SAM2完成物体描述与定位
- 对图像应用VGGT以实现三维场景重建
- 以模板化问答对作为大语言模型(LLMs)的上下文,生成一致的空间推理轨迹

**SpaceThinker 数据集**包含超过12000条合成样本,这些样本通过VQASynth在[the cauldron数据集](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron)的本地化叙事拆分子集的图像上生成。**SpaceThinker** 的格式与[Llama-Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset)相似,支持开启/关闭推理功能,以兼容基础大语言模型[Llama-3.1-Nemotron-Nano-8B-v1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1)。
该模型借鉴了[SpatialVLM(Chen等,2024)](https://spatial-vlm.github.io/)的研究思路,引入了基于Molmo、VGGT、SAM2的三维场景重建流水线的合成推理轨迹。
**数据集摘要**
- 约12000条合成空间推理轨迹
- 问题类型:空间关系类(含距离(带单位)、上方、左侧、包含、最接近等)
- 数据格式:RGB图像+问题+带推理过程的答案
- 数据集地址:[remyxai/SpaceThinker](https://huggingface.co/datasets/remyxai/SpaceThinker)
- 代码地址:[基于VQASynth合成空间推理轨迹](https://github.com/remyxai/VQASynth)
🧠 体验使用**SpaceThinker 数据集**训练的最新视觉语言模型:[SpaceThinker-Qwen2.5-VL-3B](https://huggingface.co/remyxai/SpaceThinker-Qwen2.5VL-3B)
📖 了解更多关于[推理型视觉语言模型(thinking VLM)](https://huggingface.co/spaces/open-r1/README/discussions/10)的内容
## 局限性
- 在杂乱场景或相机视角变化下,模型性能可能下降。
- 本数据集通过对互联网图像数据集进行合成推理生成。
- 底层数据集(the cauldron的本地化叙事拆分)固有的多模态偏差可能仍然存在。
> 建议用户对模型输出进行批判性评估,并考虑针对特定领域调整合成流水线以优化安全性与性能。
## 引用
@misc{VQASynth,
author = {remyxai},
title = {VQASynth},
year = {2024},
note = {GitHub repository},
url = {https://github.com/remyxai/VQASynth/tree/main}
}
提供机构:
maas
创建时间:
2025-10-09



