PhysicalAI-Spatial-Intelligence-Warehouse
收藏魔搭社区2026-05-09 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/nv-community/PhysicalAI-Spatial-Intelligence-Warehouse
下载链接
链接失效反馈官方服务:
资源简介:
# Physical AI Spatial Intelligence Warehouse
## Overview
The Physical AI Spatial Intelligence Warehouse is a comprehensive synthetic dataset designed to advance 3D scene understanding in warehouse environments. Generated using NVIDIA's Omniverse, this dataset focuses on spatial reasoning through natural language question-answering pairs that cover four key categories: spatial relationships (left/right), multi-choice questions, distance measurements, and object counting. Each data point includes RGB-D images, object masks, and natural language Q&A pairs with normalized single-word answers. The annotations are automatically generated using rule-based templates and refined using LLMs for more natural language responses. We hope this dataset will inspire new research directions and innovative solutions in warehouse automation, from intelligent inventory management to advanced safety monitoring.
## Dataset Description
### Dataset Owner(s)
NVIDIA
## Dataset Creation Date:
We started to create this dataset in January 2025.
### Dataset Characterization
- Data Collection Method:
- Synthetic: RGB images, depth images
- Labeling Method:
- Automatic:
- Object tags: Automatic with IsaacSim / Omniverse
- Region masks: Automatic with IsaacSim / Omniverse
- Text annotations, question-answer pairs: Automatic with rule-based template, optionally refined with Llama-3.1-70B-Instruct (subject to redistribution and use requirements in the Llama 3.1 Community License Agreement at https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct/blob/main/LICENSE.).
## Dataset Quantification
We have 499k QA pairs as training set, with 19k QA pairs for testing, and 1.9k QA for validation. The data also comes with around 95k RGB-D image pairs in total.
Questions cover 4 major categories:
- `left_right`: understand the spatial relationship between different objects / regions
- `multi_choice_question(mcq)`: identify the index of target from multiple candidate objects / regions
- `distance`: estimate the distance (in meters) between different objects / regions
- `count`: ask about the number of certain type of objects that satisifies the condition (leftmost, specific categories)

### Directory Structure
```shell
├── train
│ ├── depths
│ │ ├── <frame_id1>_depth.png
| | | ...
│ │ └── <frame_idn>_depth.png
│ └── images
│ ├── <frame_id1>.png
| | ...
│ └── <frame_idn>.png
├── val
│ ├── depths
│ └── images
├── test
│ ├── depths
│ └── images
├── train.json
├── val.json
└── test.json
```

### Annotation Format (3D-VLM-Challenge) for `Warehouse Spatial Intelligence`
Annotations are provided in the `train.json`, `val.json`, containing multiple single-round QnA pair with related meta information following LLaVA[1] format for VLM training.
In addition to that,
- we provide `normalized_answer` field for quantitative evaluation with accuracy and error metrics between Ground-truth and predicted answer
- the original answer from 'gpt' becomes `freeform_answer` field
- `rle` denotes the corresponding masks per object in order following pycoco format (we provide sample code for loading)
- `category` denotes the question category
Note that `test.json` only contains `id`, `image`, `"conversations"`, and `rle` fields
See below for detailed example.
```json
{
"id": "9d17ba0ab1df403db91877fe220e4658",
"image": "000190.png",
"conversations": [
{
"from": "human",
"value": "<image>\nCould you measure the distance between the pallet <mask> and the pallet <mask>?"
},
{
"from": "gpt",
"value": "The pallet [Region 0] is 6.36 meters from the pallet [Region 1]."
}
],
"rle": [
{
"size": [
1080,
1920
],
"counts": "bngl081MYQ19010ON2jMDmROa0ol01_RO2^m0`0PRODkm0o0bQOUO[n0U2N2M3N2N2N3L3N2N1N1WO_L]SOa3el0_LYSOb3il0]LTSOf3ll0ZLRSOh3nl0XLPSOj3Pm0VLmROn3Rm0SLkROo3Um0QLiROQ4Wm081O00N3L3N2O10000010O0000000001O01O00000001O10O01O003M2N0010O0000000001O01O00000M3N201N1001O00000001O01O000001O0001O000000000010O00000000010O0002N00001O3N1N000000000001O000000000O2M200O1M3N20001CQSOoKol0n3TSORLll0k3WSOVLhl0g3[SOYLel0f3\\SOZLdl0c3_SO]Lal0Z2nROcNe0RO^l0\\1kSO_OJUOmn0g0WQOZOhn0a0]QO_Odn0=_QOCan0:bQOF^n08eQOG[n07gQOIYn04jQOMUn00nQO0d[nm0"
},
{
"size": [
1080,
1920
],
"counts": "^PmU1j1no000000000000000000001O0000000000001O0000000000001O0000000000001O0000000000001O00000g1YN000001O01O00gNfQOTOZn0d0fQODZn06eQO_N1\\1Zn0OfQO<[n0^OgQOe0Yn0UOmQOl0Rn0oNnQOV1Rn0dNPRO`1Pn0[NTROf1lm0TNZROl1fm0oM_ROQ2en00O000000000M3K6J5K5K5K5N201O0000000000010O000000000010O0000000001O01O00M3K5K5K6N10000001O01O0000000001O01O00000000010OmLWROW2im0dM\\RO\\2dm0`M`RO`2`m0[MeROe2Xn0O0001O00000001O3NO01ON2L4L4O110O000000000010O0000000010O0000000000000001O0eMaQO]1_n0`NdQO`1\\n0\\NiQO_1[n0]NiQOY1an0aNeQOW1bo0H8G9G[oN_OfP19aoNG^fjc0"
}
],
"category": "distance",
"normalized_answer": "6.36",
"freeform_answer": "The pallet [Region 0] is 6.36 meters from the pallet [Region 1]."
}
```
## Usage
#### Getting started
First download the dataset
```shell
# You can also use `huggingface-cli download`
git clone https://huggingface.co/datasets/nvidia/PhysicalAI-Spatial-Intelligence-Warehouse
cd PhysicalAI-Spatial-Intelligence-Warehouse
# we need to untar images for train/test subsets
for dir in train test; do
for subdir in images depths; do
if [ -d "$dir/$subdir" ]; then
echo "Processing $dir/$subdir"
cd "$dir/$subdir"
tar -xzf chunk_*.tar.gz
# rm chunk_*.tar.gz
cd ../..
fi
done
done
```
#### Visualization
```shell
python ./utils/visualize.py \
--image_folder ./val/images/ \
--depth_folder ./val/depths/ \
--annotations_file ./val.json \
--num_samples 10
```
#### Evaluation
For sanity check and understand your model performance, you could locally evaluate your results on the provided validation set. We require the submission format (JSON) on test set following below format, in which `id` and `normalized_answer` are all necessary.
```json
[
{
"id": "000123",
"normalized_answer": "1.22"
},
{
"id": "ab23dm",
"normalized_answer": "left"
},
{
"id": "ac348d",
"normalized_answer": "4"
},
…
]
```
Suppose you have your prediction results under `utils/assets/perfect_predictions_val.json`, you could check your predictions by:
```
# sanity check with perfect answer
python ./utils/compute_scores.py \
--gt_path ./val.json \
--pred_path ./utils/assets/perfect_predictions_val.json
```
## References
[1] Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. In NeurIPS.
## Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
## Changelog
- **2025-05-24**: Initial data drop with train/val/test splits
# 实体AI空间智能仓库(Physical AI Spatial Intelligence Warehouse)
## 概述
实体AI空间智能仓库是一套综合性合成数据集,旨在推动仓库场景下的三维场景理解研究。该数据集由NVIDIA(英伟达)的Omniverse平台生成,核心聚焦于通过自然语言问答对开展空间推理任务,涵盖四大核心类别:空间关系(左/右)、选择题、距离测量与物体计数。每个数据样本均包含RGB-D图像、物体掩码(mask)以及带有标准化单字答案的自然语言问答对。注释首先通过基于规则的模板自动生成,随后借助大语言模型(LLM)进行优化,以获得更自然的语言回复。我们期望本数据集能够为仓库自动化领域的新研究方向与创新解决方案提供启发,涵盖智能库存管理至高级安全监测等诸多场景。
## 数据集说明
### 数据集所有者
NVIDIA(英伟达)
### 数据集创建日期
我们于2025年1月启动本数据集的构建工作。
### 数据集特征
- 数据采集方式:
- 合成数据:RGB图像、深度图像
- 标注方式:
- 自动标注:
- 物体标签:通过IsaacSim/Omniverse自动生成
- 区域掩码:通过IsaacSim/Omniverse自动生成
- 文本注释与问答对:通过基于规则的模板自动生成,可选择使用Llama-3.1-70B-Instruct进行优化,使用需遵循《Llama 3.1社区许可协议》相关再分发与使用要求,详见https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct/blob/main/LICENSE.
## 数据集量化指标
本数据集包含训练集49.9万组问答对、测试集1.9万组问答对以及验证集1.9千组问答对,总共有约9.5万组RGB-D图像对。问题涵盖四大类别:
- `left_right`:理解不同物体/区域间的空间关系
- `multi_choice_question(mcq)`:从多个候选物体/区域中识别目标索引
- `distance`:估算不同物体/区域间的距离(单位:米)
- `count`:查询满足特定条件(如最左侧、特定类别)的物体数量

### 目录结构
shell
├── train
│ ├── depths
│ │ ├── <frame_id1>_depth.png
| | | ...
│ │ └── <frame_idn>_depth.png
│ └── images
│ ├── <frame_id1>.png
| | ...
│ └── <frame_idn>.png
├── val
│ ├── depths
│ └── images
├── test
│ ├── depths
│ └── images
├── train.json
├── val.json
└── test.json

### 「仓库空间智能」任务注释格式(3D-VLM-Challenge)
注释文件`train.json`与`val.json`中包含多组单轮问答对及相关元信息,采用适配视觉语言模型(VLM)训练的LLaVA[1]格式。此外:
- 提供`normalized_answer`字段,用于通过准确率与误差指标开展真值与预测答案间的定量评估
- 原来自`gpt`的回复字段更名为`freeform_answer`
- `rle`字段按照COCO格式,按顺序提供每个物体对应的掩码,我们提供了加载示例代码
- `category`字段表示问题所属类别
注意`test.json`仅包含`id`、`image`、`"conversations"`与`rle`字段
下方为详细示例:
json
{
"id": "9d17ba0ab1df403db91877fe220e4658",
"image": "000190.png",
"conversations": [
{
"from": "human",
"value": "<image>
Could you measure the distance between the pallet <mask> and the pallet <mask>?"
},
{
"from": "gpt",
"value": "The pallet [Region 0] is 6.36 meters from the pallet [Region 1]."
}
],
"rle": [
{
"size": [
1080,
1920
],
"counts": "bngl081MYQ19010ON2jMDmROa0ol01_RO2^m0`0PRODkm0o0bQOUO[n0U2N2M3N2N2N3L3N2N1N1WO_L]SOa3el0_LYSOb3il0]LTSOf3ll0ZLRSOh3nl0XLPSOj3Pm0VLmROn3Rm0SLkROo3Um0QLiROQ4Wm081O00N3L3N2O10000010O0000000001O01O00000001O10O01O003M2N0010O0000000001O01O00000M3N201N1001O00000001O01O000001O0001O000000000010O00000000010O0002N00001O3N1N000000000001O000000000O2M200O1M3N20001CQSOoKol0n3TSORLll0k3WSOVLhl0g3[SOYLel0f3SOZLdl0c3_SO]Lal0Z2nROcNe0RO^l01kSO_OJUOmn0g0WQOZOhn0a0]QO_Odn0=_QOCan0:bQOF^n08eQOG[n07gQOIYn04jQOMUn00nQO0d[nm0"
},
{
"size": [
1080,
1920
],
"counts": "^PmU1j1no000000000000000000001O0000000000001O0000000000001O0000000000001O0000000000001O00000g1YN000001O01O00gNfQOTOZn0d0fQODZn06eQO_N11Zn0OfQO<[n0^OgQOe0Yn0UOmQOl0Rn0oNnQOV1Rn0dNPRO`1Pn0[NTROf1lm0TNZROl1fm0oM_ROQ2en00O000000000M3K6J5K5K5K5N201O0000000000010O000000000010O0000000001O01O00M3K5K5K6N10000001O01O0000000001O01O00000000010OmLWROW2im0dMRO2dm0`M`RO`2`m0[MeROe2Xn0O0001O00000001O3NO01ON2L4L4O110O000000000010O0000000010O0000000000000001O0eMaQO]1_n0`NdQO`1
0NiQO_1[n0]NiQOY1an0aNeQOW1bo0H8G9G[oN_OfP19aoNG^fjc0"
}
],
"category": "distance",
"normalized_answer": "6.36",
"freeform_answer": "The pallet [Region 0] is 6.36 meters from the pallet [Region 1]."
}
## 使用指南
#### 快速上手
首先下载数据集:
shell
# 也可使用`huggingface-cli download`
git clone https://huggingface.co/datasets/nvidia/PhysicalAI-Spatial-Intelligence-Warehouse
cd PhysicalAI-Spatial-Intelligence-Warehouse
# 需解压训练/测试子集的图像文件
for dir in train test; do
for subdir in images depths; do
if [ -d "$dir/$subdir" ]; then
echo "正在处理 $dir/$subdir"
cd "$dir/$subdir"
tar -xzf chunk_*.tar.gz
# rm chunk_*.tar.gz
cd ../..
fi
done
done
#### 可视化
shell
python ./utils/visualize.py
--image_folder ./val/images/
--depth_folder ./val/depths/
--annotations_file ./val.json
--num_samples 10
#### 评估
如需进行合理性校验并了解模型性能,可在提供的验证集上本地评估模型结果。测试集提交格式需为JSON,格式如下,其中`id`与`normalized_answer`为必填字段:
json
[
{
"id": "000123",
"normalized_answer": "1.22"
},
{
"id": "ab23dm",
"normalized_answer": "left"
},
{
"id": "ac348d",
"normalized_answer": "4"
},
…
]
若预测结果保存于`utils/assets/perfect_predictions_val.json`,可通过以下命令验证预测结果:
# 使用完美答案进行合理性校验
python ./utils/compute_scores.py
--gt_path ./val.json
--pred_path ./utils/assets/perfect_predictions_val.json
## 参考文献
[1] Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. In NeurIPS.
## 伦理考量
NVIDIA(英伟达)认为可信AI是一项共同责任,我们已建立相关政策与实践规范,以支持各类AI应用的开发。开发者在遵循我们的服务条款下载或使用本数据集时,应与其内部模型团队协作,确保本模型符合相关行业与应用场景的要求,并应对可能出现的产品误用问题。
请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)报告安全漏洞或NVIDIA AI相关问题。
## 更新日志
- **2025-05-24**:首次发布数据,包含训练/验证/测试划分
提供机构:
maas
创建时间:
2025-05-25



