UAV-DualCog

Name: UAV-DualCog
Creator: maas
Published: 2026-04-16 17:34:07
License: 暂无描述

魔搭社区2026-04-16 更新2026-05-03 收录

下载链接：

https://modelscope.cn/datasets/Lozumi/UAV-DualCog

下载链接

链接失效反馈

官方服务：

资源简介：

# UAV-DualCog Dataset Repository Guide Last updated: 2026-04-08 This is the official dataset repository guide for **UAV-DualCog**. The corresponding paper is currently under peer review, and this dataset release is made public under a single-blind policy. ## 1. What UAV-DualCog Is UAV-DualCog is a drone-centric multimodal reasoning benchmark for **dual cognition**: self-aware reasoning and environment-aware reasoning under aerial observation. The release targets two complementary goals: - benchmark evaluation for multimodal foundation models, - reusable structured assets for downstream dataset users. The benchmark is organized around one primary capability axis and one observation axis: - dual cognition: - self-aware reasoning, - environment-aware reasoning; - media: - image tasks, - video tasks. The key point is that **dual cognition** is the capability being evaluated, while **image and video** are the media used to expose that capability. This design yields a benchmark that does not only test answer selection, but also tests whether a model can align its reasoning with spatial evidence or temporal evidence. ## 1.1 Quick Start Recommended entry points: 1. Read this dataset card to understand the release scope and file contracts. 2. Use the benchmark website to inspect task definitions, examples, and leaderboard views: - https://uav-dualcog.lozumi.com/ 3. Use the official code repository for loading, preprocessing, and evaluation: - https://github.com/SmartDianLab/UAV-DualCog 4. Use the AerialVLN simulator package when reproducing simulator-backed collection or rendering: - https://www.kaggle.com/datasets/shuboliu/aerialvln-simulators For detailed benchmark definitions, construction details, and usage instructions, the benchmark website should be treated as the primary external reference. ## 2. Benchmark Scope Current core release: - 12 released benchmark scenes, - 512 validated landmarks, - 4096 image QA samples, - 2048 video QA samples, - 4 image task families, - 2 video task families. All currently released benchmark task files are test-only. The repository does not currently expose public `train` or `validation` splits for task evaluation. The released 12-scene benchmark subset is drawn from a larger reviewed scene pool. In the public repository, the benchmark task layer and the scene asset layer do not have identical scope: - `task_data` currently corresponds to the 12-scene benchmark release; - `scene_data` covers the full set of 18 reviewed scenes that have public geometry and landmark-review assets. This means the repository exposes a broader scene asset pool than the current benchmark task split. Scene-level geometric assets and reviewed landmark assets are provided so that users can inspect the benchmark context rather than treating the task files as opaque black boxes. For clarity: - `scene_data` is a supporting public asset release rather than a training split; - `task_data` is a benchmark evaluation release and should be treated as test data. ## 3. Capability Definition ### 3.1 Self-aware reasoning Self-aware reasoning evaluates whether a UAV agent can reason about itself: - where it is relative to a landmark, - what it will observe after a described motion, - what behavior it is executing, - when that behavior occurs. ### 3.2 Environment-aware reasoning Environment-aware reasoning evaluates whether a UAV agent can reason about the external world from its current motion context: - where the target landmark is relative to the UAV, - which action is appropriate given the landmark-relative situation, - how many times a landmark becomes visible in a mission, - during which time intervals the landmark is visible. ### 3.3 Evidence-aware evaluation UAV-DualCog explicitly separates: - semantic correctness, - evidence grounding. For image tasks, a model is evaluated on both: - selecting the correct answer option, - localizing the landmark with a normalized bounding box. For video tasks, a model is evaluated on both: - predicting the correct semantic answer, - localizing the relevant time interval(s). This is one of the core benchmark design principles: answer-only success is not sufficient if the supporting spatial or temporal evidence is incorrect. ## 4. Task Families ### 4.1 Image tasks (Stage 4) The image branch contains four task families. Each released landmark contributes both `4way` and `8way` difficulty variants. 1. `self_where` - Canonical display name: `Landmark-Relative Position Reasoning` - Cognition: self-aware - Input: one landmark-centric reference image plus one egocentric query observation - Output: one answer option and one landmark bounding box on the query image - Core question: where is the UAV relative to the landmark 2. `self_what` - Canonical display name: `Future Observation Prediction` - Cognition: self-aware - Input: one reference image plus a future-view multiple-choice set - Output: one answer option - Core question: which future observation matches the described motion outcome 3. `env_where` - Canonical display name: `Self-Relative Position Reasoning` - Cognition: environment-aware - Input: one current egocentric observation - Output: one answer option and one landmark bounding box on the query image - Core question: where is the landmark relative to the UAV 4. `env_how` - Canonical display name: `Landmark-Driven Action Decision` - Cognition: environment-aware - Input: one current egocentric observation - Output: one answer option and one landmark bounding box on the query image - Core question: what action decision is appropriate under the current landmark-relative situation ### 4.2 Video tasks (Stage 3) The video branch contains two task families. 1. `self_instance_recognition_joint` - Canonical display name: `Flight Behavior Recognition and Temporal Localization` - Cognition: self-aware - Input: task video plus mission-conditioned context - Output: behavior option(s) and temporal interval(s) - Public reporting also derives: - composite-level semantic accuracy, - atomic-level semantic accuracy, - temporal localization quality. 2. `env_visibility_reasoning` - Canonical display name: `Landmark Visibility Counting and Interval Reasoning` - Cognition: environment-aware - Input: task video plus target landmark reference - Output: visibility count and visible time interval(s) ### 4.3 Task summary table | Task ID | Display name | Modality | Cognition | Main input | Main output | | --- | --- | --- | --- | --- | --- | | `self_where` | Landmark-Relative Position Reasoning | image | self-aware | reference image + query observation | option + bbox | | `self_what` | Future Observation Prediction | image | self-aware | reference image + future-view options | option | | `env_where` | Self-Relative Position Reasoning | image | environment-aware | query observation | option + bbox | | `env_how` | Landmark-Driven Action Decision | image | environment-aware | query observation | option + bbox | | `self_instance_recognition_joint` | Flight Behavior Recognition and Temporal Localization | video | self-aware | task video + mission context | option(s) + interval(s) | | `env_visibility_reasoning` | Landmark Visibility Counting and Interval Reasoning | video | environment-aware | task video + landmark context | count + interval(s) | ## 5. Evaluation Objects and Metrics ### 5.1 Image tasks Image-task prediction objects contain: - `answer_option_id` - optionally `bbox_xyxy_norm` Main metrics include: - option accuracy, - `BBox Acc@50IoU`, - mean IoU. ### 5.2 Video tasks Video-task prediction objects contain: - answer option(s) or behavior label(s), - interval(s) in seconds, - for visibility tasks, visible count. Main metrics include: - semantic correctness, - temporal IoU or interval agreement, - count accuracy for visibility reasoning. The public leaderboard may present aggregated summary views for readability, but the underlying task manifests and experiment outputs retain the task-level prediction structure. ## 6. Repository Scope and Boundary The public repository is the release-facing layer of the dataset. It includes: - scene-level geometry and reviewed landmarks, - released benchmark task assets, - released manifests and render requests, - benchmark-ready media references. The scope is asymmetric by design: - `scene_data` contains the complete 18-scene reviewed scene release; - `task_data` currently contains the 12-scene benchmark task release. The released task layer is also split-asymmetric in another sense: - the repository currently provides public benchmark test data only; - it does not provide public train or validation task splits. It intentionally excludes many internal generation-time artifacts, including: - internal logs, - temporary caches, - internal experiment workspaces, - internal review-only intermediate files not needed for public reproduction. ## 7. Top-Level Layout The public repository is conceptually split into two release layers. ```text scene_data/ airsim_env_*/ pcd_map/ landmarks_raw/ landmarks_review/ task_data/ airsim_env_*/ image_tasks/ assets/ manifests/ render_requests/ selections/ video_tasks/ missions/ datasets/ selections/ ``` ### 7.1 `scene_data` This layer stores scene-level assets and landmark review outputs. Important release note: - `scene_data` is not restricted to the 12 benchmark test scenes. - The current public release contains all 18 reviewed scenes with available scene geometry and landmark-review outputs. - `pcd_map/` - fused point-cloud assets and geometry support files. - `landmarks_raw/` - pre-review landmark candidate outputs. - `landmarks_review/` - reviewed landmark instances and downstream-consumable landmark metadata. ### 7.2 `task_data` This layer stores benchmark task artifacts. - `image_tasks/` - Stage 4 image QA assets, manifests, and render requests. - `video_tasks/` - Stage 3 mission-level task videos, final-task metadata, and released manifests. ## 8. Data Contracts The following files are the main public contracts that downstream users should treat as stable interfaces. ### 8.1 Scene review contract `scene_data/<scene>/landmarks_review/<scene>.valid_instances.json` This is the reviewed landmark handoff file used by later stages. It provides: - stable landmark instance ids, - reviewed category/subcategory/description fields, - reference RGB view assets, - geometry and instance context needed for task generation. ### 8.2 Image-task manifest contract `task_data/<scene>/image_tasks/manifests/<scene>.latest_manifest.json` Top-level fields include: - generation metadata, - scene id and engine, - released task types and difficulty sets, - `samples`. Each sample contains fields such as: - `sample_id` - `landmark_id` - `task_family` - `task_group` - `difficulty` - `reference_image` - `reference_image_with_bbox` - `reference_bbox_xyxy_norm` - `target_image` - `answer_bbox_xyxy_norm` - `task_type` - `label_options` - `answer_option_id` - `prompt_text` - `user_prompt` - `system_prompt` This contract is sufficient for benchmark inference on image tasks. Representative sample shape: ```json { "sample_id": "env_7_20_120_self_shared_4way_000001_where", "task_type": "self_where", "task_group": "self-aware", "difficulty": "4way", "landmark_id": "20_120", "reference_image_with_bbox": "task_data/airsim_env_7/image_tasks/assets/reference_bbox/20_120/....jpg", "target_image": "scene_data/airsim_env_7/landmarks_raw/rgb_views/20_120/....jpg", "label_options": [ {"option_id": "A", "label": "..."}, {"option_id": "B", "label": "..."} ], "answer_option_id": "D", "answer_bbox_xyxy_norm": [0.31, 0.27, 0.58, 0.76] } ``` ### 8.3 Video-task manifest contract `task_data/<scene>/video_tasks/datasets/<scene>.latest_manifest.json` Top-level fields include: - generation metadata, - scene id and engine, - released forms, - task-group flags, - `samples`, - manifest-level `summary`. Each sample contains fields such as: - `sample_id` - `form` - `task_group` - `task_name` - `task_display_name` - `mission_id` - `mission_family` - `landmark_id` - `reference_image_with_bbox` - `overview_image` - `keyframe_board_image` - `video_path` - `video_web_path` - `fps` - `frame_count` - `flight_description` - `visible_count` - `visible_intervals_sec` - `difficulty_band` - `choice_options` - `answer_option_ids` - `answer_items` This contract is the benchmark-facing video task interface. Representative sample shape: ```json { "sample_id": "env_7_batch_env_7_10_55_atomic_0075_self_instance_recognition_joint_000001", "form": "self_instance_recognition_joint", "task_group": "self-state", "mission_id": "batch_env_7_10_55_atomic_0075", "landmark_id": "10_55", "reference_image_with_bbox": "task_data/airsim_env_7/video_tasks/cache/assets/reference_bbox/10_55/....jpg", "video_path": "task_data/airsim_env_7/video_tasks/missions/.../final_task/task_rgb.mp4", "video_web_path": "task_data/airsim_env_7/video_tasks/missions/.../final_task/task_rgb_web.mp4", "fps": 5, "frame_count": 157, "visible_count": 1, "visible_intervals_sec": [{"start_sec": 0.0, "end_sec": 2.7}], "difficulty_band": "easy", "choice_options": [ {"option_id": "A", "label": "..."}, {"option_id": "B", "label": "..."} ], "answer_option_ids": ["C"], "answer_items": [ {"option_id": "C", "label": "...", "intervals_sec": [{"start_sec": 1.2, "end_sec": 6.8}]} ] } ``` ### 8.4 Mission-level Stage 3 contract `task_data/<scene>/video_tasks/missions/<mission_id>/final_task/task_data.json` This file is the mission-level ground-truth contract behind Stage 3 tasks. It contains: - `video` - media paths, - frame manifests, - fps, - frame counts, - video dimensions, - capture dimensions; - `target_presence` - frame-level or interval-level target presence information; - `task_tracks` - task-specific supervision for: - `environmental_awareness`, - `self_state_awareness`. This file is the correct entry point when a user needs mission-level temporal supervision rather than only released sample-level manifests. In practice: - use `video_tasks/datasets/<scene>.latest_manifest.json` for benchmark inference and leaderboard-style evaluation; - use `missions/<mission_id>/final_task/task_data.json` when mission-level temporal supervision or frame-level inspection is needed. ## 9. Media and Path Semantics Image and video paths stored in manifests are release-facing references, not arbitrary internal cache paths. For Stage 4: - `reference_image_with_bbox` points to the released reference image with GT bbox overlay, - `target_image` points to the released query observation. Depending on task subtype and release path, Stage 4 media may point either to: - released task assets under `task_data/.../image_tasks/assets/...`, or - scene-level source views under `scene_data/.../landmarks_raw/rgb_views/...`. For Stage 3: - `video_path` points to the released main task video, - `video_web_path` points to a web-playable derivative when available, - `reference_image_with_bbox`, `overview_image`, and `keyframe_board_image` provide auxiliary evidence views, - `task_data.json -> video.frames_manifest` and `frame_index_map` support frame-level inspection. If `video_web_path` is empty for a given sample, downstream users should fall back to `video_path`. ## 10. Usage and Reproduction Pointers This repository guide intentionally focuses on **release scope and data contracts**. To avoid divergence and duplicated maintenance, detailed operational steps (environment setup, stage-by-stage commands, benchmark execution, and evaluation scripts) are not repeated here. Please use the following as the canonical operational references: - Benchmark website (recommended reading order and Usage page): - https://uav-dualcog.lozumi.com/ - https://uav-dualcog.lozumi.com/usage/ - Official code repository (latest runnable commands and config templates): - https://github.com/SmartDianLab/UAV-DualCog - Simulator package used by construction-stage reproduction: - https://www.kaggle.com/datasets/shuboliu/aerialvln-simulators Practical split for external users: - Use **this dataset guide** for file contracts, task semantics, and manifest field definitions. - Use **website + GitHub** for concrete execution instructions and reproducibility workflows. ## 11. Benchmark Provenance The public release is produced by the four-stage UAV-DualCog construction pipeline: - Stage 1: scene point-cloud collection and fusion, - Stage 2: landmark mining, review, and structured annotation, - Stage 3: behavior-driven mission generation and video task construction, - Stage 4: landmark-centered image QA generation. The benchmark website provides: - task explanations, - prompt templates, - examples, - leaderboard views, - analysis pages. Official benchmark site: - https://uav-dualcog.lozumi.com/ Official code repository: - https://github.com/SmartDianLab/UAV-DualCog ## 12. Practical Notes for External Users - Field names should be consumed in their canonical JSON form. - Task ids such as `self_where` or `env_visibility_reasoning` should be treated as stable benchmark identifiers. - Display names on the website are reader-facing aliases; manifests retain machine-facing ids. - Some repository paths may differ slightly across mirrors or release bundles. The canonical structure is the contract described in this guide. - For actual loading and benchmark evaluation, prefer the official GitHub implementation instead of reimplementing parsers from scratch: - https://github.com/SmartDianLab/UAV-DualCog - For detailed benchmark definitions, construction explanations, and usage walkthroughs, prefer the public benchmark website: - https://uav-dualcog.lozumi.com/ - For simulator-backed reproduction, use the released AerialVLN simulator package: - https://www.kaggle.com/datasets/shuboliu/aerialvln-simulators ## 13. Citation and License - License: follow the repository card and platform metadata for the active release. - Citation: cite the UAV-DualCog dataset release and record the repository version/date used in evaluation. - Benchmark-facing supplementary explanations are maintained at: - https://uav-dualcog.lozumi.com/ - Official loading and evaluation code is maintained at: - https://github.com/SmartDianLab/UAV-DualCog - Simulator dependency for reproduction is maintained at: - https://www.kaggle.com/datasets/shuboliu/aerialvln-simulators

提供机构：

maas

创建时间：

2026-04-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集