guangmulizi/PanoEnv

Name: guangmulizi/PanoEnv
Creator: guangmulizi
Published: 2026-04-06 09:19:02
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/guangmulizi/PanoEnv

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - visual-question-answering - image-text-to-text language: - en tags: - 3d-scene-understanding - spatial-reasoning - multi-view - vqa - panorama - 360-degree - equirectangular size_categories: - 10K<n<100K --- # PanoEnv-QA: A Large-Scale Geometry-Grounded Panoramic VQA Benchmark for 3D Spatial Intelligence <p align="center"> <img src="https://img.shields.io/badge/Task-Visual%20Question%20Answering-blue" /> <img src="https://img.shields.io/badge/Format-360°%20ERP-green" /> <img src="https://img.shields.io/badge/QA%20Pairs-14.8K-orange" /> <img src="https://img.shields.io/badge/Environments-60-purple" /> </p> ## 📖 Overview **PanoEnv-QA** is a large-scale Visual Question Answering benchmark designed specifically to probe **3D spatial intelligence** on **Equirectangular Projection (ERP)** panoramas. Built from synthetic but photorealistic 3D environments ([TartanAir](https://theairlab.org/tartanair-dataset/)), PanoEnv-QA offers over **14.8K questions** spanning five categories that progressively require stronger 3D understanding—all grounded in precise 3D annotations (depth, semantics, and 3D bounding boxes). ### Key Features - **Geometry-Grounded**: All QA pairs are programmatically derived from physical ground truth (depth maps, semantic segmentation, 3D bounding boxes) - **360° Panoramic**: Targets the unique challenges of ERP images including geometric distortions and multi-view reasoning - **RL-Ready**: Designed to serve both as a reliable evaluation benchmark and as a source of verifiable supervision signals for reinforcement learning - **Diverse & Balanced**: 60 diverse environments with balanced question distribution across 5 major categories ## 📊 Dataset Statistics | Split | Environments | Images | QA Pairs | |:-----:|:------------:|:------:|:--------:| | Train | 60 | 415 | 10,340 | | Val | 60 | 60 | 1,496 | | Test | 60 | 120 | 2,991 | | **Total** | **60** | **595** | **14,827** | ### Question Distribution | Major Category | # Questions | Percentage | |:---------------|:-----------:|:----------:| | Intrinsic Attribute Comparison | 2,975 | 20.1% | | Object Distance Estimation | 2,975 | 20.1% | | Relative Spatial Positioning | 2,975 | 20.1% | | Environment Identification | 2,965 | 20.0% | | Camera View Source Identification | 2,937 | 19.8% | ### Question Types | Type | Count | Percentage | |:-----|:-----:|:----------:| | Multiple Choice | 7,552 | 50.9% | | True/False | 4,300 | 29.0% | | Open-Ended | 2,975 | 20.1% | ### Answer Characteristics - **1,894** unique answers - Average answer length: **10.9** characters - Yes/No ratio: **45.3% / 54.7%** (balanced to prevent shortcuts) ## 🎯 Five Question Categories ### ① Camera View Source Identification Evaluates whether the model recognizes that an ERP image is a composite panorama stitched from **six perspective views** (front/back/left/right/top/bottom). Understanding this structure is essential for handling artifacts near seam boundaries. **Sub-categories**: `primary_view`, `multi_view_visibility`, `seam_attribution`, `multi_object_relationship`, `shared_visibility` ### ② Object Distance Estimation Evaluates quantitative and qualitative **depth reasoning**, moving beyond 2D heuristics (e.g., size as a proxy for distance) toward true 3D understanding. **Sub-categories**: `depth_similarity`, `depth_binary`, `depth_compare`, `depth_triplet_farthest`, `distance_description` ### ③ Environment Identification Evaluates high-level **scene understanding and contextual reasoning**, testing whether the model can classify environments based on object composition and architectural style. **Sub-categories**: `env_binary_judgement`, `env_mcq`, `env_confusable_pair`, `env_scene_judgement`, `env_category_identification`, `env_attribute` ### ④ Relative Spatial Positioning Assesses the model's ability to reconstruct accurate **3D spatial relationships** between objects—an inherently difficult task due to ERP distortions. **Sub-categories**: `relpos_cardinal`, `relpos_binary`, `relpos_distance_straightline`, `relpos_distance_components`, `relpos_triplet_extreme` ### ⑤ Intrinsic Attribute Comparison Probes the model's understanding of **intrinsic, view-independent physical properties** of objects (3D shape and size), requiring inference from 2D projections and depth. **Sub-categories**: `volume_comparison`, `volume_binary`, `size_triplet_extreme`, `shape_flatness`, `shape_elongation` ## 📁 Data Structure Each sample (`*_qa.json`) contains: ### `sampled_objects` (20 objects per image) ```json { "label": "building", "bbox": [x1, y1, x2, y2], "depth": 12.5, "area": 15000, "primary_camera": "front", "visible_cameras": ["front", "left", "top"], "depth_stats": { "p20": 10.5, "p25": 11.0, "p50": 12.5, "p75": 14.0, "p80": 14.5, "iqr": 3.0 }, "bbox_3d": { "min_x": -5.0, "max_x": 5.0, "min_y": 0.0, "max_y": 10.0, "min_z": 8.0, "max_z": 15.0 }, "volume": 700.0, "centroid_3d": [0.0, 5.0, 11.5], "is_seam": true, "seam_types": ["crosses_left_back"], "is_polar": false } ``` ### `questions` (25 questions per image) ```json { "major_category": "relative_position", "sub_category": "relpos_cardinal", "question_type": "open_ended", "question": "What is the spatial relationship of the building relative to the tree in the 3D world?", "answer": "The building is in front of and to the right of and above the tree.", "related_object_ids": [1, 5], "question_id": 1 } ``` ### `visualizations/` PNG visualizations for each question showing the relevant objects highlighted. ## 🌍 60 Diverse Environments <details> <summary>Click to expand full environment list</summary> **Industrial & Infrastructure** - AbandonedCable, AbandonedFactory, AbandonedFactory2, CarWelding, CoalMine, ConstructionSite, FactoryWeather, IndustrialHangar, OldIndustrialCity, Sewerage, UrbanConstruction **Urban & City** - CyberPunk, CyberPunkDowntown, Downtown, HongKong, JapaneseAlley, JapaneseCity, ModernCityDowntown, ModularNeighborhood, ModularNeighborhoodIntExt, ModUrbanCity, Rome, SoulCity, VictorianStreet **Historical & Cultural** - AncientTowns, Antiquity3D, CastleFortress, GothicIsland, HQWesternSaloon, MiddleEast, OldTownFall, OldTownNight, OldTownSummer, OldTownWinter, Ruins, WesternDesertTown **Residential & Interior** - AmericanDiner, ArchVizTinyHouseDay, ArchVizTinyHouseNight, CountryHouse, Hospital, House, Office, OldBrickHouseDay, OldBrickHouseNight, Prison, Restaurant, RetroOffice, Supermarket **Nature & Special** - AbandonedSchool, AmusementPark, Apocalyptic, DesertGasStation, Fantasy, NordicHarbor, Ocean, PolarSciFi, SeasideTown, WaterMillDay, WaterMillNight </details> ## 🔬 Benchmark Results We evaluated 14 state-of-the-art VLMs on our test set: | Model | Total Acc. (%) | T/F (%) | MCQ (%) | OE (%) | |:------|:--------------:|:-------:|:-------:|:------:| | Qwen2.5-VL-7B | 49.34 | 65.19 | 57.24 | 6.39 | | Qwen2.5-VL-32B | 42.70 | 62.47 | 44.96 | 8.36 | | InternVL2.5-26B | 47.07 | 64.51 | 54.33 | 3.44 | | Qwen3-VL-8B | 47.91 | 62.85 | 55.24 | 7.70 | | DeepSeek-VL2-Base | 38.86 | 57.30 | 40.36 | 8.36 | | **GRPO-Balanced (Ours)** | **52.93** | **68.78** | **58.90** | **14.83** | **Key Findings:** - Best zero-shot accuracy is only **49.34%**, revealing significant gaps in 3D spatial understanding - Open-ended accuracy collapses to **< 9%** for all baselines - Our GRPO-trained 7B model achieves **SoTA** performance, outperforming 32B models - OE accuracy improves from 6.39% to 14.83% (**+132% relative gain**) ## 🚀 Usage ### Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("7zkk/PanoEnv") # Access splits train_data = dataset["train"] val_data = dataset["val"] test_data = dataset["test"] ``` ### Example: Accessing a Sample ```python sample = dataset["train"][0] print(f"Environment: {sample['env']}") print(f"Image ID: {sample['image_id']}") print(f"Number of objects: {len(sample['sampled_objects'])}") print(f"Number of questions: {len(sample['questions'])}") ``` ### For RL Training (GRPO) PanoEnv-QA is designed to support reinforcement learning with ground-truth-guided rewards: ```python # Example reward routing based on question type def get_reward(question_type, prediction, ground_truth): if question_type == "true_false": return 1.0 if prediction.lower() == ground_truth.lower() else 0.0 elif question_type == "multiple_choice": return mcq_matching_reward(prediction, ground_truth) elif question_type == "open_ended": return spatial_reward(prediction, ground_truth) # axis-wise matching ``` ## 📐 Technical Details ### ERP to 3D Projection For any pixel $(p_x, p_y)$ in an ERP image of size $W \times H$: **Spherical coordinates:** $$\lambda = \left(\frac{p_x}{W} - 0.5\right) \cdot 2\pi, \quad \phi = -\left(\frac{p_y}{H} - 0.5\right) \cdot \pi$$ **3D Cartesian coordinates:** $$x = -d \cdot \cos(\phi) \cdot \sin(\lambda)$$ $$y = d \cdot \sin(\phi)$$ $$z = -d \cdot \cos(\phi) \cdot \cos(\lambda)$$ where $d$ is the depth value. The coordinate system is right-handed with +Y upward, +X rightward, and –Z forward. ## 📚 Citation If you use PanoEnv-QA in your research, please cite: ```bibtex @inproceedings{panoenv2026, title={PanoEnv-QA: A Large-Scale Geometry-Grounded Panoramic VQA Benchmark for 3D Spatial Intelligence}, author={Zekai Lin, Xu Zheng}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2026} } ``` ## 🙏 Acknowledgments This dataset is built upon [TartanAir](https://theairlab.org/tartanair-dataset/), a synthetic dataset providing precise 3D ground truth (depth and segmentation). ## 📄 License This dataset is released under the [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.

提供机构：

guangmulizi

5,000+

优质数据集

54 个

任务类型

进入经典数据集