ViCA-thinking-2.68k
收藏魔搭社区2025-12-26 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/nkkbr/ViCA-thinking-2.68k
下载链接
链接失效反馈官方服务:
资源简介:
# ViCA-Thinking-2.68K
[](https://github.com/nkkbr/ViCA)
[](https://huggingface.co/nkkbr/ViCA2)
[](https://huggingface.co/nkkbr/ViCA)
## Quickstart
You can load our dataset using the following code:
```python
from datasets import load_dataset
vica_thinking = load_dataset("nkkbr/ViCA-thinking-2.68k")
```
## Overview
This is the dataset we created to further fine-tune the [ViCA](https://huggingface.co/nkkbr/ViCA) model. Our motivation stems from the observation that, after being trained on large-scale visuospatial instruction data (e.g., [ViCA-322K](https://huggingface.co/datasets/nkkbr/ViCA-322K)), ViCA tends to output final answers directly without any intermediate reasoning. We hypothesize that **encouraging the model to generate explicit reasoning steps ("Thoughts") prior to the final answer** can improve its performance on benchmarks such as [VSI-Bench](https://github.com/ViCA-Lab/VSI-Bench).
## Data Generation Pipeline
We first trained [lmms-lab/LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) on the ViCA-322K dataset to obtain [ViCA](https://huggingface.co/nkkbr/ViCA). Then, we used the present dataset, **ViCA-Thinking-2.68K**, to further fine-tune ViCA and produce **ViCA-Thinking** — a version of the model that explicitly reasons before answering.
Each data point in this dataset contains:
* **Thoughts**: Intermediate reasoning
* **Response**: Explanation or answer from the model
* **Final Answer**: The boxed final result
Inspired by the formatting approach proposed in *[s1: Simple test-time scaling](https://github.com/simplescaling/s1)*, we adopt the following structured prompt format:
```
<|im_start|>think
Thoughts
<|im_start|>response
Response
<|im_start|>final
Final Answer
```
### Example Entry
```json
[
{
"from": "human",
"value": "<image>\nIf measured from the nearest point, what is the distance between stool and table in meters?"
},
{
"from": "gpt",
"value": "<|im_start|>think\n\n1. **Understand the Goal:** The request asks for the distance in meters between the *nearest points* of the stool and the table visible in the video.\n2. **Identify the Objects:**\n * **Stool:** A small, round, light wooden stool (similar to an Alvar Aalto stool) is visible. It appears first near a chest of drawers (around 0:07) and later positioned between two beds (clearly from 0:49 onwards).\n * **Table:** There isn't a conventional large table like a dining table or desk. The most prominent piece of furniture that functions as a surface or table is the large wooden *chest of drawers* (seen clearly around 0:07, and again from 0:37-0:49, sometimes with a changing mat on top). Let's assume this chest of drawers is the \"table\" referred to in the question.\n3. **Observe Positions:**\n * Around 0:07, the stool is visible positioned right next to the wooden chest of drawers.\n * From 0:49 onwards, the camera pans down, revealing two single beds. The stool has been moved and is now placed *between* the headboards of the two beds, acting as a shared bedside table (visible clearly at 0:51-0:57 and 1:59-2:03).\n4. **Determine \"Nearest Points\" and Context:** The question asks for the distance between the *nearest points*. This distance changes depending on where the stool is placed.\n * **Scenario 1 (Stool next to Chest):** When the stool is placed directly beside the chest of drawers (as seen around 0:07 and implied position just before 0:49), their nearest points (the side of the stool seat and the side of the chest) are virtually touching or only a few centimeters apart.\n * **Scenario 2 (Stool between Beds):** When the stool is between the beds, the distance to the chest of drawers (which is against a different wall) is much larger. Estimating this distance: The chest of drawers seems to be about 1 meter wide. The single beds are likely around 90cm wide. The stool sits between them. The distance across the floor from the stool (between the beds) to the chest of drawers looks to be roughly 1.5 to 2 meters.\n5. **Interpret the Question's Intent:** The question is ambiguous about *which* position of the stool to consider. However, asking for the distance between two objects often implies either their typical arrangement or the closest arrangement shown. The closest arrangement explicitly shown (or strongly implied) is when the stool is right next to the chest of drawers. In this configuration, the distance between their nearest points is minimal.\n6. **Estimate the Distance (Closest Scenario):** In the scenario where the stool is immediately adjacent to the chest of drawers (\"table\"), the distance between their nearest points is essentially zero or extremely close to zero (e.g., 0.01-0.05 meters if there's a tiny gap). For practical estimation purposes based on the visual, 0 meters is the most appropriate answer for this configuration.\n7. **Final Decision:** Based on the closest observed proximity, the distance is effectively 0 meters.\n\n<|im_start|>response\n\n1. The video shows a room with furniture including a wooden chest of drawers (functioning as a \"table\"/surface) and a small wooden stool.\n2. The stool is shown in two primary locations during the video:\n * Initially (around 0:07 and implied before 0:49), the stool is placed directly beside the wooden chest of drawers.\n * Later (from 0:49 onwards), the stool is moved and placed between two single beds, acting as a bedside table.\n3. The question asks for the distance between the *nearest points* of the stool and the \"table\" (chest of drawers).\n4. In the first observed position, where the stool is right next to the chest of drawers, their nearest surfaces are almost touching or have a negligible gap.\n5. Therefore, the distance between their nearest points in this configuration is approximately 0 meters.\n6. If considering the second position (stool between beds), the distance to the chest of drawers would be significantly larger (estimated 1.5-2 meters), but this represents the distance *after* the stool has been moved away from the chest. The closest distance depicted is near zero.\n7. Assuming the question refers to the closest proximity shown or implied, the distance is approximately 0 meters.\n\n<|im_start|>final\n\n\\[\n\\boxed{0}\n\\]"
}
]
```
## Dataset Composition
Although relatively small in size (only 2,680 examples), this dataset is **diverse and lightweight**, making it ideal for fine-tuning with minimal compute. Each entry corresponds to a **unique video**, and the data is sourced from three well-established datasets:
| Source Dataset | # of Examples |
| ---------------------------------------------------------- | ------------- |
| ARKitScenes ([link](https://github.com/apple/ARKitScenes)) | 935 |
| ScanNet ([link](http://www.scan-net.org)) | 898 |
| ScanNet++ ([link](https://kaldir.vc.in.tum.de/scannetpp/)) | 847 |
## Data Generation Process
The dataset was generated in **late April 2025** using **Gemini 2.5 Pro Preview 03-25**, one of the most powerful proprietary language models available at the time. This model is capable of generating **reasoning steps ("Thoughts")** followed by a final **Response**, and the overall quality of its reasoning is notably strong.
However, a major limitation of the Gemini API is that it does **not return the "Thoughts" segment separately**—only the final answer is accessible programmatically. To address this, we designed a custom prompting strategy to encourage the model to **embed its reasoning within the response**.
Our pipeline works as follows:
1. For a given video-question pair (e.g., from VSI-Bench), we generated **10 candidate responses** using Gemini.
2. From these, we filtered out the ones that yielded the **correct final answer**.
3. We then extracted the corresponding **Thoughts** (i.e., the reasoning portions) from these correct completions.
4. To assess quality, we used **GPT-4o** to evaluate and **score each Thought**, selecting the **highest-rated one**.
5. We repeated this process for **five different video samples**, collecting a set of **five high-quality Thoughts**.
6. Based on these, we constructed a **guideline document** that instructs the model on how to write high-quality Thoughts.
7. Finally, for each target video-question pair, we prompted Gemini to:
* Follow the guideline;
* Refer to the five high-quality example Thoughts;
* Produce a new answer that includes a well-structured `Thought`, followed by the `Response`, and finally the `Final Answer`.
You can find the **full prompt template used during data generation** [here](https://huggingface.co/datasets/nkkbr/ViCA-thinking-2.68k/blob/main/prompt.md).
**Note:** Within that prompt template, you will see the following line:
```text
**What is the total number of tables shown in this scene?**
```
This placeholder question should be **replaced with the actual question** corresponding to each video in the dataset.
# ViCA-Thinking-2.68K
[](https://github.com/nkkbr/ViCA)
[](https://huggingface.co/nkkbr/ViCA2)
[](https://huggingface.co/nkkbr/ViCA)
## 快速入门
你可以通过以下代码加载本数据集:
python
from datasets import load_dataset
vica_thinking = load_dataset("nkkbr/ViCA-thinking-2.68k")
## 概述
本数据集专为进一步微调[ViCA](https://huggingface.co/nkkbr/ViCA)模型而构建。我们的设计动机源于一项观察:在大规模视觉空间指令数据(例如[ViCA-322K](https://huggingface.co/datasets/nkkbr/ViCA-322K))上完成预训练后,ViCA模型往往会直接输出最终答案,而不包含任何中间推理过程。我们提出假设:**鼓励模型在最终答案前生成显式的推理步骤(即“思考过程(Thoughts)”)**,能够提升其在[VSI-Bench](https://github.com/ViCA-Lab/VSI-Bench)等基准测试中的表现。
## 数据生成流程
我们首先在ViCA-322K数据集上微调[lmms-lab/LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2),从而得到[ViCA](https://huggingface.co/nkkbr/ViCA)模型。随后,我们使用本数据集**ViCA-Thinking-2.68K**对ViCA进行进一步微调,得到**ViCA-Thinking**——一种能够在作答前显式进行推理的模型版本。
本数据集的每条数据均包含以下内容:
* **思考过程(Thoughts)**:中间推理步骤
* **回复(Response)**:模型给出的解释或答案
* **最终答案(Final Answer)**:带框的最终结果
受*[s1: 简单测试时缩放(Simple test-time scaling)](https://github.com/simplescaling/s1)*中提出的格式方案启发,我们采用以下结构化提示格式:
<|im_start|>think
Thoughts
<|im_start|>response
Response
<|im_start|>final
Final Answer
### 示例条目
json
[
{
"from": "human",
"value": "<image>
If measured from the nearest point, what is the distance between stool and table in meters?"
},
{
"from": "gpt",
"value": "<|im_start|>think
1. **Understand the Goal:** The request asks for the distance in meters between the *nearest points* of the stool and the table visible in the video.
2. **Identify the Objects:**
* **Stool:** A small, round, light wooden stool (similar to an Alvar Aalto stool) is visible. It appears first near a chest of drawers (around 0:07) and later positioned between two beds (clearly from 0:49 onwards).
* **Table:** There isn't a conventional large table like a dining table or desk. The most prominent piece of furniture that functions as a surface or table is the large wooden *chest of drawers* (seen clearly around 0:07, and again from 0:37-0:49, sometimes with a changing mat on top). Let's assume this chest of drawers is the "table" referred to in the question.
3. **Observe Positions:**
* Around 0:07, the stool is visible positioned right next to the wooden chest of drawers.
* From 0:49 onwards, the camera pans down, revealing two single beds. The stool has been moved and is now placed *between* the headboards of the two beds, acting as a shared bedside table (visible clearly at 0:51-0:57 and 1:59-2:03).
4. **Determine "Nearest Points" and Context:** The question asks for the distance between the *nearest points*. This distance changes depending on where the stool is placed.
* **Scenario 1 (Stool next to Chest):** When the stool is placed directly beside the chest of drawers (as seen around 0:07 and implied position just before 0:49), their nearest points (the side of the stool seat and the side of the chest) are virtually touching or only a few centimeters apart.
* **Scenario 2 (Stool between Beds):** When the stool is between the beds, the distance to the chest of drawers (which is against a different wall) is much larger. Estimating this distance: The chest of drawers seems to be about 1 meter wide. The single beds are likely around 90cm wide. The stool sits between them. The distance across the floor from the stool (between the beds) to the chest of drawers looks to be roughly 1.5 to 2 meters.
5. **Interpret the Question's Intent:** The question is ambiguous about *which* position of the stool to consider. However, asking for the distance between two objects often implies either their typical arrangement or the closest arrangement shown. The closest arrangement explicitly shown (or strongly implied) is when the stool is right next to the chest of drawers. In this configuration, the distance between their nearest points is minimal.
6. **Estimate the Distance (Closest Scenario):** In the scenario where the stool is immediately adjacent to the chest of drawers ("table"), the distance between their nearest points is essentially zero or extremely close to zero (e.g., 0.01-0.05 meters if there's a tiny gap). For practical estimation purposes based on the visual, 0 meters is the most appropriate answer for this configuration.
7. **Final Decision:** Based on the closest observed proximity, the distance is effectively 0 meters.
<|im_start|>response
1. The video shows a room with furniture including a wooden chest of drawers (functioning as a "table"/surface) and a small wooden stool.
2. The stool is shown in two primary locations during the video:
* Initially (around 0:07 and implied before 0:49), the stool is placed directly beside the wooden chest of drawers.
* Later (from 0:49 onwards), the stool is moved and placed between two single beds, acting as a bedside table.
3. The question asks for the distance between the *nearest points* of the stool and the "table" (chest of drawers).
4. In the first observed position, where the stool is right next to the chest of drawers, their nearest surfaces are almost touching or have a negligible gap.
5. Therefore, the distance between their nearest points in this configuration is approximately 0 meters.
6. If considering the second position (stool between beds), the distance to the chest of drawers would be significantly larger (estimated 1.5-2 meters), but this represents the distance *after* the stool has been moved away from the chest. The closest distance depicted is near zero.
7. Assuming the question refers to the closest proximity shown or implied, the distance is approximately 0 meters.
<|im_start|>final
\[
\boxed{0}
\]"
}
]
## 数据集构成
尽管本数据集规模较小(仅包含2680条样本),但**兼具多样性与轻量性**,非常适合以极低的计算资源进行微调。每条数据对应一个**独立的视频**,数据源自三个成熟的公开数据集:
| 源数据集 | 样本数量 |
| ---------------------------------------------------------- | ------------- |
| ARKitScenes ([链接](https://github.com/apple/ARKitScenes)) | 935 |
| ScanNet ([链接](http://www.scan-net.org)) | 898 |
| ScanNet++ ([链接](https://kaldir.vc.in.tum.de/scannetpp/)) | 847 |
## 数据生成过程详解
本数据集于**2025年4月下旬**通过**Gemini 2.5 Pro Preview 03-25**生成,该模型是当时性能顶尖的闭源大语言模型之一。其能够先生成**推理步骤(Thoughts)**,再输出最终**回复**,且整体推理质量出众。
但Gemini API存在一个核心局限:**无法单独返回“思考过程”片段**,仅能通过编程接口获取最终答案。为解决这一问题,我们设计了定制化的提示策略,引导模型**将推理过程嵌入回复内容中**。
我们的完整流程如下:
1. 针对给定的视频-问题对(例如源自VSI-Bench的样本),我们使用Gemini生成**10份候选回复**。
2. 从中筛选出能够得到**正确最终答案**的候选回复。
3. 从这些正确的生成结果中提取对应的**思考过程(Thoughts)**(即推理部分)。
4. 为评估推理质量,我们使用**GPT-4o**对每条思考过程进行评分与筛选,选取**评分最高的结果**。
5. 我们针对**5个不同的视频样本**重复上述流程,收集得到5条高质量的思考过程。
6. 基于这些样本,我们构建了一份**指南文档**,用于指导模型生成高质量的思考过程。
7. 最终,针对每个目标视频-问题对,我们向Gemini发起如下提示:
- 遵循该指南文档;
- 参考上述5条高质量示例思考过程;
- 生成包含结构化`思考过程(Thought)`、随后是`回复(Response)`、最终为`最终答案(Final Answer)`的完整作答内容。
你可以通过[此链接](https://huggingface.co/datasets/nkkbr/ViCA-thinking-2.68k/blob/main/prompt.md)获取**数据生成过程中使用的完整提示模板**。
**注意:** 在该提示模板中,你将看到如下内容:
text
**What is the total number of tables shown in this scene?**
该占位符问题应**替换为数据集中对应视频的实际问题**。
提供机构:
maas
创建时间:
2025-08-19



