SenseNova-SI-800K
收藏魔搭社区2026-05-16 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/SenseNova/SenseNova-SI-800K
下载链接
链接失效反馈官方服务:
资源简介:
**EN** | [中文](README_CN.md)
# SenseNova-SI-800K
<a href="https://github.com/OpenSenseNova/SenseNova-SI" target="_blank">
<img alt="Code" src="https://img.shields.io/badge/SenseNova_SI-Code-100000?style=flat-square&logo=github&logoColor=white" height="20" />
</a>
<a href="https://arxiv.org/abs/2511.13719" target="_blank">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-SenseNova_SI-red?logo=arxiv" height="20" />
</a>
<a href="https://github.com/EvolvingLMMs-Lab/EASI" target="_blank">
<img alt="Code" src="https://img.shields.io/badge/EASI-Code-100000?style=flat-square&logo=github&logoColor=white" height="20" />
</a>
<a href="https://easi.lmms-lab.com/leaderboard" target="_blank">
<img alt="Leaderboard" src="https://img.shields.io/badge/%F0%9F%A4%97%20_EASI-Leaderboard-ffc107?color=ffc107&logoColor=white" height="20" />
</a>
🔥Please check out our newly released [**SenseNova-SI-8M**](https://huggingface.co/sensenova/SenseNova-SI-8M), official full-scale training dataset of the SenseNova-SI series. SenseNova-SI-8M contains ~8.16 million carefully curated training samples spanning ~2.72 million unique images, organized under a rigorous taxonomy of spatial capabilities.
The SenseNova-SI-800K dataset provided here is a downsampled subset of SenseNova-SI-8M, specifically designed for studying scaling laws.
## Overview
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence.
In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the **SenseNova-SI family**,
built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel).
We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M:
eight million diverse data samples under a rigorous taxonomy of spatial capabilities.
SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks, while maintaining strong general multimodal understanding.
More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training,
analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously.
All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
*In the future, SenseNova-SI will be integrated with larger-scale in-house models.*
## Release Information
To facilitate the research in this area, as a first step, we have released a highly effective subset, [**SenseNova-SI-800K**](https://huggingface.co/datasets/sensenova/SenseNova-SI-800K).
Since SenseNova-SI is designed to study scaling laws, we observe that this initial release captures a substantial portion of the gains.
With **SenseNova-SI-800K**, the trained model [**SenseNova-SI-1.1-InternVL3-8B-800K**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B-800K) demonstrates significant improvements over the base model, and achieves competitive performance against strong spatial intelligence baselines.
<table>
<thead>
<tr>
<th>Model</th>
<th>SI Dataset</th>
<th>VSI</th>
<th>MMSI</th>
<th>MindCube-Tiny</th>
<th>ViewSpatial</th>
<th>SITE</th>
</tr>
</thead>
<tbody>
<tr>
<td>InternVL3-8B</td><td>-</td><td>42.1</td><td>28.0</td><td>41.5</td><td>38.6</td><td>41.1</td>
</tr>
<tr>
<td>VST-7B-SFT</td><td>VST-P-4.1M</td><td>60.6</td><td>32.0</td><td>39.7</td><td>50.5</td><td>39.6</td>
</tr>
<tr>
<td>Cambrian-S-7B</td><td>VSI-590K</td><td>67.5</td><td>25.8</td><td>39.6</td><td>40.9</td><td>33.0</td>
</tr>
<tr>
<td><strong><a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B-800K/">*SenseNova-SI-1.1-InternVL3-8B-800K</strong></td>
<td><strong><a href="https://huggingface.co/datasets/sensenova/SenseNova-SI-800K/">SenseNova-SI-800K</strong></td>
<td><strong>60.9</strong></td>
<td><strong>36.4</strong></td>
<td><strong>56.9</strong></td>
<td><strong>52.5</strong></td>
<td><strong>47.7</strong></td>
</tr>
<tr>
<td><strong><a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B/">SenseNova-SI-1.1-InternVL3-8B</a></strong></td>
<td><strong><a href="https://huggingface.co/sensenova/SenseNova-SI-8M">SenseNova-SI-8M</strong></td>
<td><strong>68.7</strong></td>
<td><strong>43.3</strong></td>
<td><strong>85.6</strong></td>
<td><strong>54.6</strong></td>
<td><strong>47.7</strong></td>
</tr>
</tbody>
</table>
Note that ***SenseNova-SI-1.1-InternVL3-8B-800K** is trained on the **SenseNova-SI-800K** subset to provide a reference for researchers working with the 800K-scale dataset. It is released exclusively for scaling-law analysis and research validation, and is **not intended to serve as a primary recommended model** of the SenseNova-SI series.
## Data format
Our data is stored in the **SenseNova-SI-800K.jsonl** file using the JSONL (JSON Lines) format, where each line represents an independent data entry. Each entry is a dictionary organized in the following format, containing three main fields: **`id`**, **`conversations`**, and **`image`**.
The `id` serves as a unique identifier for each data sample.
The `image` field is a list of strings specifying image paths, all given as paths relative to the root data directory.
The `conversations` field is a list of dialogue turns, where each turn is a dictionary with two key-value pairs: `from`, indicating the speaker identity (e.g., human or gpt), and `value`, indicating the textual content. Within `value`, the `<image>` placeholder marks where images are inserted, and the number of `<image>` placeholders match the number of images listed in the `image` field.
```json
{
"id": 0,
"conversations": [
{"from": "human", "value": "<image>\nuser input <image>\nuser input"},
{"from": "gpt", "value": "assistant output"},
{"from": "human", "value": "<image>\nuser input"},
{"from": "gpt", "value": "assistant output"}
],
"image": ["path/to/image1.jpg", "path/to/image2.jpg", "path/to/image3.jpg"],
}
```
## Download & Extract Images
The image data is packaged into **93 independent ~4 GB zip files** (`images_part_001.zip` through `images_part_093.zip`). Each zip can be extracted on its own — they are **not split volumes**, so you don't need all parts to extract any one of them. Every zip preserves the full `images/` directory structure and extracting them all to the same destination reconstructs the complete image tree.
Two one-click extraction scripts are included in the repo root:
**Linux / macOS / Git Bash:**
```bash
bash extract_all.sh # extract to the script's parent directory
bash extract_all.sh /path/to/dir # extract to a specified directory
```
**Windows PowerShell:**
```powershell
.\extract_all.ps1 # extract to the script's parent directory
.\extract_all.ps1 -Dest D:\data # extract to a specified directory
```
You can also extract manually with any zip tool (e.g. `unzip`, 7-Zip, WinRAR) — each zip is a standard archive.
### Evaluation
After training, you can use [EASI](https://github.com/EvolvingLMMs-Lab/EASI) to evaluate your model on mainstream spatial intelligence benchmarks.
EASI supports over 20 spatial intelligence models and more than 10 spatial benchmarks, offering Docker for one-click spatial intelligence evaluation.
## 🖊️ Citation
```bib
@InProceedings{sensenova-si,
title = {Scaling Spatial Intelligence with Multimodal Foundation Models},
author = {Cai, Zhongang and Wang, Ruisi and Gu, Chenyang and Pu, Fanyi and Xu, Junxiang and Wang, Yubo and Yin, Wanqi and Yang, Zhitao and Wei, Chen and Sun, Qingping and Zhou, Tongxi and Li, Jiaqi and Pang, Hui En and Qian, Oscar and Wei, Yukun and Lin, Zhiqian and Shi, Xuanke and Deng, Kewang and Han, Xiaoyang and Chen, Zukai and Fan, Xiangyu and Deng, Hanming and Lu, Lewei and Pan, Liang and Li, Bo and Liu, Ziwei and Wang, Quan and Lin, Dahua and Yang, Lei},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
```
**英文** | [中文](README_CN.md)
# SenseNova-SI-800K
<a href="https://github.com/OpenSenseNova/SenseNova-SI" target="_blank">
<img alt="代码" src="https://img.shields.io/badge/SenseNova_SI-代码-100000?style=flat-square&logo=github&logoColor=white" height="20" />
</a>
<a href="https://arxiv.org/abs/2511.13719" target="_blank">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-SenseNova_SI-red?logo=arxiv" height="20" />
</a>
<a href="https://github.com/EvolvingLMMs-Lab/EASI" target="_blank">
<img alt="代码" src="https://img.shields.io/badge/EASI-代码-100000?style=flat-square&logo=github&logoColor=white" height="20" />
</a>
<a href="https://huggingface.co/spaces/lmms-lab-si/EASI-Leaderboard" target="_blank">
<img alt="排行榜" src="https://img.shields.io/badge/%F0%9F%A4%97%20_EASI-排行榜-ffc107?color=ffc107&logoColor=white" height="20" />
</a>
## 概述
尽管多模态基础模型已取得显著进展,但在空间智能领域仍存在令人意外的短板。在本研究中,我们旨在通过扩展多模态基础模型的规模,在**SenseNova-SI系列**中培育空间智能。该系列基于已成熟的多模态基础架构构建,涵盖视觉理解模型(即Qwen3-VL与InternVL3)以及统一理解与生成模型(即Bagel)。
我们采用规范严谨的方法构建高性能且鲁棒的空间智能系统,通过系统整理SenseNova-SI-8M数据集实现这一目标:该数据集包含800万条多样化数据样本,且基于严格的空间能力分类体系构建。SenseNova-SI在各类空间智能基准测试中展现出前所未有的性能,同时仍保持了出色的通用多模态理解能力。
更为重要的是,我们分析了数据规模扩展的影响,探讨了多样化数据训练催生的泛化能力的早期迹象,研究了过拟合与语言捷径的风险,开展了空间思维链(Chain-of-Thought)推理的初步研究,并验证了其潜在的下游应用价值。SenseNova-SI是一项持续推进的项目,本报告将持续更新。所有新训练的多模态基础模型均已公开发布,以推动该方向的进一步研究。*未来,SenseNova-SI将与更大规模的自研模型进行集成。*
## 发布信息
为推动该领域的研究,我们第一步发布了高性能子集[**SenseNova-SI-800K**](https://huggingface.co/datasets/sensenova/SenseNova-SI-800K)。由于SenseNova-SI的设计初衷是研究缩放定律,我们观察到此次初始发布的数据集已覆盖了相当一部分性能增益。基于**SenseNova-SI-800K**训练得到的模型[**SenseNova-SI-1.1-InternVL3-8B-800K**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B-800K)相较于基础模型实现了显著性能提升,且在顶尖空间智能基线模型中展现出具有竞争力的表现。
<table>
<thead>
<tr>
<th>模型</th>
<th>SI数据集</th>
<th>VSI</th>
<th>MMSI</th>
<th>MindCube-Tiny</th>
<th>ViewSpatial</th>
<th>SITE</th>
</tr>
</thead>
<tbody>
<tr>
<td>InternVL3-8B</td><td>-</td><td>42.1</td><td>28.0</td><td>41.5</td><td>38.6</td><td>41.1</td>
</tr>
<tr>
<td>VST-7B-SFT</td><td>VST-P-4.1M</td><td>60.6</td><td>32.0</td><td>39.7</td><td>50.5</td><td>39.6</td>
</tr>
<tr>
<td>Cambrian-S-7B</td><td>VSI-590K</td><td>67.5</td><td>25.8</td><td>39.6</td><td>40.9</td><td>33.0</td>
</tr>
<tr>
<td><strong><a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B-800K/">*SenseNova-SI-1.1-InternVL3-8B-800K</strong></td>
<td><strong><a href="https://huggingface.co/datasets/sensenova/SenseNova-SI-800K/">SenseNova-SI-800K</strong></td>
<td><strong>60.9</strong></td>
<td><strong>36.4</strong></td>
<td><strong>56.9</strong></td>
<td><strong>52.5</strong></td>
<td><strong>47.7</strong></td>
</tr>
<tr>
<td><strong><a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B/">SenseNova-SI-1.1-InternVL3-8B</a></strong></td>
<td><strong>SenseNova-SI-8M</strong></td>
<td><strong>68.7</strong></td>
<td><strong>43.3</strong></td>
<td><strong>85.6</strong></td>
<td><strong>54.6</strong></td>
<td><strong>47.7</strong></td>
</tr>
</tbody>
</table>
请注意,***SenseNova-SI-1.1-InternVL3-8B-800K***是基于**SenseNova-SI-800K**子集训练得到的,旨在为使用80万规模数据集的研究人员提供参考。该模型仅用于缩放定律分析与研究验证,**并非SenseNova-SI系列的官方推荐主模型**。
## 数据格式
我们的数据以JSONL(JSON Lines)格式存储于**SenseNova-SI-800K.jsonl**文件中,每一行代表一条独立的数据条目。每条数据均为遵循以下格式组织的字典,包含三个核心字段:**`id`**、**`conversations`**与**`image`**。
`id`为每条数据样本提供唯一标识符。
`image`字段为字符串列表,用于指定图像路径,所有路径均为相对于数据根目录的相对路径。
`conversations`字段为对话轮次列表,每一轮次均为包含两个键值对的字典:`from`用于标识说话者身份(例如human或gpt),`value`用于存储文本内容。在`value`中,`<image>`占位符标记了图像插入的位置,且`<image>`占位符的数量与`image`字段中列出的图像数量一致。
json
{
"id": 0,
"conversations": [
{"from": "human", "value": "<image>
user input <image>
user input"},
{"from": "gpt", "value": "assistant output"},
{"from": "human", "value": "<image>
user input"},
{"from": "gpt", "value": "assistant output"}
],
"image": ["path/to/image1.jpg", "path/to/image2.jpg", "path/to/image3.jpg"],
}
## 下载与解压图像
图像数据被打包为**93个独立的约4GB压缩包**(`images_part_001.zip`至`images_part_093.zip`)。每个压缩包均可单独解压——它们并非分卷压缩包,因此无需获取全部压缩包即可解压任意一个。每个压缩包均保留完整的`images/`目录结构,将所有压缩包解压至同一目标目录即可还原完整的图像目录树。
代码仓库根目录中提供了两个一键解压脚本:
**Linux / macOS / Git Bash:**
bash
bash extract_all.sh # 解压至脚本所在的父目录
bash extract_all.sh /path/to/dir # 解压至指定目录
**Windows PowerShell:**
powershell
.extract_all.ps1 # 解压至脚本所在的父目录
.extract_all.ps1 -Dest D:\data # 解压至指定目录
你也可以使用任意解压工具(例如`unzip`、7-Zip、WinRAR)手动解压——每个压缩包均为标准归档文件。
### 评估
训练完成后,你可以使用[EASI](https://github.com/EvolvingLMMs-Lab/EASI)在主流空间智能基准测试集上评估你的模型。
EASI支持超过20种空间智能模型与10余个空间基准测试集,并提供Docker容器以实现一键式空间智能评估。
## 🖊️ 引用
bib
@article{sensenova-si,
title = {Scaling Spatial Intelligence with Multimodal Foundation Models},
author = {Cai, Zhongang and Wang, Ruisi and Gu, Chenyang and Pu, Fanyi and Xu, Junxiang and Wang, Yubo and Yin, Wanqi and Yang, Zhitao and Wei, Chen and Sun, Qingping and Zhou, Tongxi and Li, Jiaqi and Pang, Hui En and Qian, Oscar and Wei, Yukun and Lin, Zhiqian and Shi, Xuanke and Deng, Kewang and Han, Xiaoyang and Chen, Zukai and Fan, Xiangyu and Deng, Hanming and Lu, Lewei and Pan, Liang and Li, Bo and Liu, Ziwei and Wang, Quan and Lin, Dahua and Yang, Lei},
journal = {arXiv preprint arXiv:2511.13719},
year = {2025}
}
提供机构:
maas
创建时间:
2025-12-22



