guangmulizi/PanoEnv
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/guangmulizi/PanoEnv
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- visual-question-answering
- image-text-to-text
language:
- en
tags:
- 3d-scene-understanding
- spatial-reasoning
- multi-view
- vqa
- panorama
- 360-degree
- equirectangular
size_categories:
- 10K<n<100K
---
# PanoEnv-QA: A Large-Scale Geometry-Grounded Panoramic VQA Benchmark for 3D Spatial Intelligence
<p align="center">
<img src="https://img.shields.io/badge/Task-Visual%20Question%20Answering-blue" />
<img src="https://img.shields.io/badge/Format-360°%20ERP-green" />
<img src="https://img.shields.io/badge/QA%20Pairs-14.8K-orange" />
<img src="https://img.shields.io/badge/Environments-60-purple" />
</p>
## 📖 Overview
**PanoEnv-QA** is a large-scale Visual Question Answering benchmark designed specifically to probe **3D spatial intelligence** on **Equirectangular Projection (ERP)** panoramas. Built from synthetic but photorealistic 3D environments ([TartanAir](https://theairlab.org/tartanair-dataset/)), PanoEnv-QA offers over **14.8K questions** spanning five categories that progressively require stronger 3D understanding—all grounded in precise 3D annotations (depth, semantics, and 3D bounding boxes).
### Key Features
- **Geometry-Grounded**: All QA pairs are programmatically derived from physical ground truth (depth maps, semantic segmentation, 3D bounding boxes)
- **360° Panoramic**: Targets the unique challenges of ERP images including geometric distortions and multi-view reasoning
- **RL-Ready**: Designed to serve both as a reliable evaluation benchmark and as a source of verifiable supervision signals for reinforcement learning
- **Diverse & Balanced**: 60 diverse environments with balanced question distribution across 5 major categories
## 📊 Dataset Statistics
| Split | Environments | Images | QA Pairs |
|:-----:|:------------:|:------:|:--------:|
| Train | 60 | 415 | 10,340 |
| Val | 60 | 60 | 1,496 |
| Test | 60 | 120 | 2,991 |
| **Total** | **60** | **595** | **14,827** |
### Question Distribution
| Major Category | # Questions | Percentage |
|:---------------|:-----------:|:----------:|
| Intrinsic Attribute Comparison | 2,975 | 20.1% |
| Object Distance Estimation | 2,975 | 20.1% |
| Relative Spatial Positioning | 2,975 | 20.1% |
| Environment Identification | 2,965 | 20.0% |
| Camera View Source Identification | 2,937 | 19.8% |
### Question Types
| Type | Count | Percentage |
|:-----|:-----:|:----------:|
| Multiple Choice | 7,552 | 50.9% |
| True/False | 4,300 | 29.0% |
| Open-Ended | 2,975 | 20.1% |
### Answer Characteristics
- **1,894** unique answers
- Average answer length: **10.9** characters
- Yes/No ratio: **45.3% / 54.7%** (balanced to prevent shortcuts)
## 🎯 Five Question Categories
### ① Camera View Source Identification
Evaluates whether the model recognizes that an ERP image is a composite panorama stitched from **six perspective views** (front/back/left/right/top/bottom). Understanding this structure is essential for handling artifacts near seam boundaries.
**Sub-categories**: `primary_view`, `multi_view_visibility`, `seam_attribution`, `multi_object_relationship`, `shared_visibility`
### ② Object Distance Estimation
Evaluates quantitative and qualitative **depth reasoning**, moving beyond 2D heuristics (e.g., size as a proxy for distance) toward true 3D understanding.
**Sub-categories**: `depth_similarity`, `depth_binary`, `depth_compare`, `depth_triplet_farthest`, `distance_description`
### ③ Environment Identification
Evaluates high-level **scene understanding and contextual reasoning**, testing whether the model can classify environments based on object composition and architectural style.
**Sub-categories**: `env_binary_judgement`, `env_mcq`, `env_confusable_pair`, `env_scene_judgement`, `env_category_identification`, `env_attribute`
### ④ Relative Spatial Positioning
Assesses the model's ability to reconstruct accurate **3D spatial relationships** between objects—an inherently difficult task due to ERP distortions.
**Sub-categories**: `relpos_cardinal`, `relpos_binary`, `relpos_distance_straightline`, `relpos_distance_components`, `relpos_triplet_extreme`
### ⑤ Intrinsic Attribute Comparison
Probes the model's understanding of **intrinsic, view-independent physical properties** of objects (3D shape and size), requiring inference from 2D projections and depth.
**Sub-categories**: `volume_comparison`, `volume_binary`, `size_triplet_extreme`, `shape_flatness`, `shape_elongation`
## 📁 Data Structure
Each sample (`*_qa.json`) contains:
### `sampled_objects` (20 objects per image)
```json
{
"label": "building",
"bbox": [x1, y1, x2, y2],
"depth": 12.5,
"area": 15000,
"primary_camera": "front",
"visible_cameras": ["front", "left", "top"],
"depth_stats": {
"p20": 10.5, "p25": 11.0, "p50": 12.5, "p75": 14.0, "p80": 14.5, "iqr": 3.0
},
"bbox_3d": {
"min_x": -5.0, "max_x": 5.0,
"min_y": 0.0, "max_y": 10.0,
"min_z": 8.0, "max_z": 15.0
},
"volume": 700.0,
"centroid_3d": [0.0, 5.0, 11.5],
"is_seam": true,
"seam_types": ["crosses_left_back"],
"is_polar": false
}
```
### `questions` (25 questions per image)
```json
{
"major_category": "relative_position",
"sub_category": "relpos_cardinal",
"question_type": "open_ended",
"question": "What is the spatial relationship of the building relative to the tree in the 3D world?",
"answer": "The building is in front of and to the right of and above the tree.",
"related_object_ids": [1, 5],
"question_id": 1
}
```
### `visualizations/`
PNG visualizations for each question showing the relevant objects highlighted.
## 🌍 60 Diverse Environments
<details>
<summary>Click to expand full environment list</summary>
**Industrial & Infrastructure**
- AbandonedCable, AbandonedFactory, AbandonedFactory2, CarWelding, CoalMine, ConstructionSite, FactoryWeather, IndustrialHangar, OldIndustrialCity, Sewerage, UrbanConstruction
**Urban & City**
- CyberPunk, CyberPunkDowntown, Downtown, HongKong, JapaneseAlley, JapaneseCity, ModernCityDowntown, ModularNeighborhood, ModularNeighborhoodIntExt, ModUrbanCity, Rome, SoulCity, VictorianStreet
**Historical & Cultural**
- AncientTowns, Antiquity3D, CastleFortress, GothicIsland, HQWesternSaloon, MiddleEast, OldTownFall, OldTownNight, OldTownSummer, OldTownWinter, Ruins, WesternDesertTown
**Residential & Interior**
- AmericanDiner, ArchVizTinyHouseDay, ArchVizTinyHouseNight, CountryHouse, Hospital, House, Office, OldBrickHouseDay, OldBrickHouseNight, Prison, Restaurant, RetroOffice, Supermarket
**Nature & Special**
- AbandonedSchool, AmusementPark, Apocalyptic, DesertGasStation, Fantasy, NordicHarbor, Ocean, PolarSciFi, SeasideTown, WaterMillDay, WaterMillNight
</details>
## 🔬 Benchmark Results
We evaluated 14 state-of-the-art VLMs on our test set:
| Model | Total Acc. (%) | T/F (%) | MCQ (%) | OE (%) |
|:------|:--------------:|:-------:|:-------:|:------:|
| Qwen2.5-VL-7B | 49.34 | 65.19 | 57.24 | 6.39 |
| Qwen2.5-VL-32B | 42.70 | 62.47 | 44.96 | 8.36 |
| InternVL2.5-26B | 47.07 | 64.51 | 54.33 | 3.44 |
| Qwen3-VL-8B | 47.91 | 62.85 | 55.24 | 7.70 |
| DeepSeek-VL2-Base | 38.86 | 57.30 | 40.36 | 8.36 |
| **GRPO-Balanced (Ours)** | **52.93** | **68.78** | **58.90** | **14.83** |
**Key Findings:**
- Best zero-shot accuracy is only **49.34%**, revealing significant gaps in 3D spatial understanding
- Open-ended accuracy collapses to **< 9%** for all baselines
- Our GRPO-trained 7B model achieves **SoTA** performance, outperforming 32B models
- OE accuracy improves from 6.39% to 14.83% (**+132% relative gain**)
## 🚀 Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("7zkk/PanoEnv")
# Access splits
train_data = dataset["train"]
val_data = dataset["val"]
test_data = dataset["test"]
```
### Example: Accessing a Sample
```python
sample = dataset["train"][0]
print(f"Environment: {sample['env']}")
print(f"Image ID: {sample['image_id']}")
print(f"Number of objects: {len(sample['sampled_objects'])}")
print(f"Number of questions: {len(sample['questions'])}")
```
### For RL Training (GRPO)
PanoEnv-QA is designed to support reinforcement learning with ground-truth-guided rewards:
```python
# Example reward routing based on question type
def get_reward(question_type, prediction, ground_truth):
if question_type == "true_false":
return 1.0 if prediction.lower() == ground_truth.lower() else 0.0
elif question_type == "multiple_choice":
return mcq_matching_reward(prediction, ground_truth)
elif question_type == "open_ended":
return spatial_reward(prediction, ground_truth) # axis-wise matching
```
## 📐 Technical Details
### ERP to 3D Projection
For any pixel $(p_x, p_y)$ in an ERP image of size $W \times H$:
**Spherical coordinates:**
$$\lambda = \left(\frac{p_x}{W} - 0.5\right) \cdot 2\pi, \quad \phi = -\left(\frac{p_y}{H} - 0.5\right) \cdot \pi$$
**3D Cartesian coordinates:**
$$x = -d \cdot \cos(\phi) \cdot \sin(\lambda)$$
$$y = d \cdot \sin(\phi)$$
$$z = -d \cdot \cos(\phi) \cdot \cos(\lambda)$$
where $d$ is the depth value. The coordinate system is right-handed with +Y upward, +X rightward, and –Z forward.
## 📚 Citation
If you use PanoEnv-QA in your research, please cite:
```bibtex
@inproceedings{panoenv2026,
title={PanoEnv-QA: A Large-Scale Geometry-Grounded Panoramic VQA Benchmark for 3D Spatial Intelligence},
author={Zekai Lin, Xu Zheng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}
```
## 🙏 Acknowledgments
This dataset is built upon [TartanAir](https://theairlab.org/tartanair-dataset/), a synthetic dataset providing precise 3D ground truth (depth and segmentation).
## 📄 License
This dataset is released under the [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.
提供机构:
guangmulizi



