UAV-DualCog
收藏魔搭社区2026-04-16 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/Lozumi/UAV-DualCog
下载链接
链接失效反馈官方服务:
资源简介:
# UAV-DualCog Dataset Repository Guide
Last updated: 2026-04-08
This is the official dataset repository guide for **UAV-DualCog**. The corresponding paper is currently under peer review, and this dataset release is made public under a single-blind policy.
## 1. What UAV-DualCog Is
UAV-DualCog is a drone-centric multimodal reasoning benchmark for **dual cognition**: self-aware
reasoning and environment-aware reasoning under aerial observation. The release targets two
complementary goals:
- benchmark evaluation for multimodal foundation models,
- reusable structured assets for downstream dataset users.
The benchmark is organized around one primary capability axis and one observation axis:
- dual cognition:
- self-aware reasoning,
- environment-aware reasoning;
- media:
- image tasks,
- video tasks.
The key point is that **dual cognition** is the capability being evaluated, while **image and
video** are the media used to expose that capability. This design yields a benchmark that does not
only test answer selection, but also tests whether a model can align its reasoning with spatial
evidence or temporal evidence.
## 1.1 Quick Start
Recommended entry points:
1. Read this dataset card to understand the release scope and file contracts.
2. Use the benchmark website to inspect task definitions, examples, and leaderboard views:
- https://uav-dualcog.lozumi.com/
3. Use the official code repository for loading, preprocessing, and evaluation:
- https://github.com/SmartDianLab/UAV-DualCog
4. Use the AerialVLN simulator package when reproducing simulator-backed collection or rendering:
- https://www.kaggle.com/datasets/shuboliu/aerialvln-simulators
For detailed benchmark definitions, construction details, and usage instructions, the benchmark
website should be treated as the primary external reference.
## 2. Benchmark Scope
Current core release:
- 12 released benchmark scenes,
- 512 validated landmarks,
- 4096 image QA samples,
- 2048 video QA samples,
- 4 image task families,
- 2 video task families.
All currently released benchmark task files are test-only. The repository does not currently expose public `train` or `validation` splits for task evaluation.
The released 12-scene benchmark subset is drawn from a larger reviewed scene pool. In the public repository, the benchmark task layer and the scene asset layer do not have identical scope:
- `task_data` currently corresponds to the 12-scene benchmark release;
- `scene_data` covers the full set of 18 reviewed scenes that have public geometry and landmark-review assets.
This means the repository exposes a broader scene asset pool than the current benchmark task split. Scene-level geometric assets and reviewed landmark assets are provided so that users can inspect the benchmark context rather than treating the task files as opaque black boxes.
For clarity:
- `scene_data` is a supporting public asset release rather than a training split;
- `task_data` is a benchmark evaluation release and should be treated as test data.
## 3. Capability Definition
### 3.1 Self-aware reasoning
Self-aware reasoning evaluates whether a UAV agent can reason about itself:
- where it is relative to a landmark,
- what it will observe after a described motion,
- what behavior it is executing,
- when that behavior occurs.
### 3.2 Environment-aware reasoning
Environment-aware reasoning evaluates whether a UAV agent can reason about the external world from its current motion context:
- where the target landmark is relative to the UAV,
- which action is appropriate given the landmark-relative situation,
- how many times a landmark becomes visible in a mission,
- during which time intervals the landmark is visible.
### 3.3 Evidence-aware evaluation
UAV-DualCog explicitly separates:
- semantic correctness,
- evidence grounding.
For image tasks, a model is evaluated on both:
- selecting the correct answer option,
- localizing the landmark with a normalized bounding box.
For video tasks, a model is evaluated on both:
- predicting the correct semantic answer,
- localizing the relevant time interval(s).
This is one of the core benchmark design principles: answer-only success is not sufficient if the supporting spatial or temporal evidence is incorrect.
## 4. Task Families
### 4.1 Image tasks (Stage 4)
The image branch contains four task families. Each released landmark contributes both `4way` and `8way` difficulty variants.
1. `self_where`
- Canonical display name: `Landmark-Relative Position Reasoning`
- Cognition: self-aware
- Input: one landmark-centric reference image plus one egocentric query observation
- Output: one answer option and one landmark bounding box on the query image
- Core question: where is the UAV relative to the landmark
2. `self_what`
- Canonical display name: `Future Observation Prediction`
- Cognition: self-aware
- Input: one reference image plus a future-view multiple-choice set
- Output: one answer option
- Core question: which future observation matches the described motion outcome
3. `env_where`
- Canonical display name: `Self-Relative Position Reasoning`
- Cognition: environment-aware
- Input: one current egocentric observation
- Output: one answer option and one landmark bounding box on the query image
- Core question: where is the landmark relative to the UAV
4. `env_how`
- Canonical display name: `Landmark-Driven Action Decision`
- Cognition: environment-aware
- Input: one current egocentric observation
- Output: one answer option and one landmark bounding box on the query image
- Core question: what action decision is appropriate under the current landmark-relative situation
### 4.2 Video tasks (Stage 3)
The video branch contains two task families.
1. `self_instance_recognition_joint`
- Canonical display name: `Flight Behavior Recognition and Temporal Localization`
- Cognition: self-aware
- Input: task video plus mission-conditioned context
- Output: behavior option(s) and temporal interval(s)
- Public reporting also derives:
- composite-level semantic accuracy,
- atomic-level semantic accuracy,
- temporal localization quality.
2. `env_visibility_reasoning`
- Canonical display name: `Landmark Visibility Counting and Interval Reasoning`
- Cognition: environment-aware
- Input: task video plus target landmark reference
- Output: visibility count and visible time interval(s)
### 4.3 Task summary table
| Task ID | Display name | Modality | Cognition | Main input | Main output |
| --- | --- | --- | --- | --- | --- |
| `self_where` | Landmark-Relative Position Reasoning | image | self-aware | reference image + query observation | option + bbox |
| `self_what` | Future Observation Prediction | image | self-aware | reference image + future-view options | option |
| `env_where` | Self-Relative Position Reasoning | image | environment-aware | query observation | option + bbox |
| `env_how` | Landmark-Driven Action Decision | image | environment-aware | query observation | option + bbox |
| `self_instance_recognition_joint` | Flight Behavior Recognition and Temporal Localization | video | self-aware | task video + mission context | option(s) + interval(s) |
| `env_visibility_reasoning` | Landmark Visibility Counting and Interval Reasoning | video | environment-aware | task video + landmark context | count + interval(s) |
## 5. Evaluation Objects and Metrics
### 5.1 Image tasks
Image-task prediction objects contain:
- `answer_option_id`
- optionally `bbox_xyxy_norm`
Main metrics include:
- option accuracy,
- `BBox Acc@50IoU`,
- mean IoU.
### 5.2 Video tasks
Video-task prediction objects contain:
- answer option(s) or behavior label(s),
- interval(s) in seconds,
- for visibility tasks, visible count.
Main metrics include:
- semantic correctness,
- temporal IoU or interval agreement,
- count accuracy for visibility reasoning.
The public leaderboard may present aggregated summary views for readability, but the underlying task manifests and experiment outputs retain the task-level prediction structure.
## 6. Repository Scope and Boundary
The public repository is the release-facing layer of the dataset. It includes:
- scene-level geometry and reviewed landmarks,
- released benchmark task assets,
- released manifests and render requests,
- benchmark-ready media references.
The scope is asymmetric by design:
- `scene_data` contains the complete 18-scene reviewed scene release;
- `task_data` currently contains the 12-scene benchmark task release.
The released task layer is also split-asymmetric in another sense:
- the repository currently provides public benchmark test data only;
- it does not provide public train or validation task splits.
It intentionally excludes many internal generation-time artifacts, including:
- internal logs,
- temporary caches,
- internal experiment workspaces,
- internal review-only intermediate files not needed for public reproduction.
## 7. Top-Level Layout
The public repository is conceptually split into two release layers.
```text
scene_data/
airsim_env_*/
pcd_map/
landmarks_raw/
landmarks_review/
task_data/
airsim_env_*/
image_tasks/
assets/
manifests/
render_requests/
selections/
video_tasks/
missions/
datasets/
selections/
```
### 7.1 `scene_data`
This layer stores scene-level assets and landmark review outputs.
Important release note:
- `scene_data` is not restricted to the 12 benchmark test scenes.
- The current public release contains all 18 reviewed scenes with available scene geometry and landmark-review outputs.
- `pcd_map/`
- fused point-cloud assets and geometry support files.
- `landmarks_raw/`
- pre-review landmark candidate outputs.
- `landmarks_review/`
- reviewed landmark instances and downstream-consumable landmark metadata.
### 7.2 `task_data`
This layer stores benchmark task artifacts.
- `image_tasks/`
- Stage 4 image QA assets, manifests, and render requests.
- `video_tasks/`
- Stage 3 mission-level task videos, final-task metadata, and released manifests.
## 8. Data Contracts
The following files are the main public contracts that downstream users should treat as stable interfaces.
### 8.1 Scene review contract
`scene_data/<scene>/landmarks_review/<scene>.valid_instances.json`
This is the reviewed landmark handoff file used by later stages. It provides:
- stable landmark instance ids,
- reviewed category/subcategory/description fields,
- reference RGB view assets,
- geometry and instance context needed for task generation.
### 8.2 Image-task manifest contract
`task_data/<scene>/image_tasks/manifests/<scene>.latest_manifest.json`
Top-level fields include:
- generation metadata,
- scene id and engine,
- released task types and difficulty sets,
- `samples`.
Each sample contains fields such as:
- `sample_id`
- `landmark_id`
- `task_family`
- `task_group`
- `difficulty`
- `reference_image`
- `reference_image_with_bbox`
- `reference_bbox_xyxy_norm`
- `target_image`
- `answer_bbox_xyxy_norm`
- `task_type`
- `label_options`
- `answer_option_id`
- `prompt_text`
- `user_prompt`
- `system_prompt`
This contract is sufficient for benchmark inference on image tasks.
Representative sample shape:
```json
{
"sample_id": "env_7_20_120_self_shared_4way_000001_where",
"task_type": "self_where",
"task_group": "self-aware",
"difficulty": "4way",
"landmark_id": "20_120",
"reference_image_with_bbox": "task_data/airsim_env_7/image_tasks/assets/reference_bbox/20_120/....jpg",
"target_image": "scene_data/airsim_env_7/landmarks_raw/rgb_views/20_120/....jpg",
"label_options": [
{"option_id": "A", "label": "..."},
{"option_id": "B", "label": "..."}
],
"answer_option_id": "D",
"answer_bbox_xyxy_norm": [0.31, 0.27, 0.58, 0.76]
}
```
### 8.3 Video-task manifest contract
`task_data/<scene>/video_tasks/datasets/<scene>.latest_manifest.json`
Top-level fields include:
- generation metadata,
- scene id and engine,
- released forms,
- task-group flags,
- `samples`,
- manifest-level `summary`.
Each sample contains fields such as:
- `sample_id`
- `form`
- `task_group`
- `task_name`
- `task_display_name`
- `mission_id`
- `mission_family`
- `landmark_id`
- `reference_image_with_bbox`
- `overview_image`
- `keyframe_board_image`
- `video_path`
- `video_web_path`
- `fps`
- `frame_count`
- `flight_description`
- `visible_count`
- `visible_intervals_sec`
- `difficulty_band`
- `choice_options`
- `answer_option_ids`
- `answer_items`
This contract is the benchmark-facing video task interface.
Representative sample shape:
```json
{
"sample_id": "env_7_batch_env_7_10_55_atomic_0075_self_instance_recognition_joint_000001",
"form": "self_instance_recognition_joint",
"task_group": "self-state",
"mission_id": "batch_env_7_10_55_atomic_0075",
"landmark_id": "10_55",
"reference_image_with_bbox": "task_data/airsim_env_7/video_tasks/cache/assets/reference_bbox/10_55/....jpg",
"video_path": "task_data/airsim_env_7/video_tasks/missions/.../final_task/task_rgb.mp4",
"video_web_path": "task_data/airsim_env_7/video_tasks/missions/.../final_task/task_rgb_web.mp4",
"fps": 5,
"frame_count": 157,
"visible_count": 1,
"visible_intervals_sec": [{"start_sec": 0.0, "end_sec": 2.7}],
"difficulty_band": "easy",
"choice_options": [
{"option_id": "A", "label": "..."},
{"option_id": "B", "label": "..."}
],
"answer_option_ids": ["C"],
"answer_items": [
{"option_id": "C", "label": "...", "intervals_sec": [{"start_sec": 1.2, "end_sec": 6.8}]}
]
}
```
### 8.4 Mission-level Stage 3 contract
`task_data/<scene>/video_tasks/missions/<mission_id>/final_task/task_data.json`
This file is the mission-level ground-truth contract behind Stage 3 tasks. It contains:
- `video`
- media paths,
- frame manifests,
- fps,
- frame counts,
- video dimensions,
- capture dimensions;
- `target_presence`
- frame-level or interval-level target presence information;
- `task_tracks`
- task-specific supervision for:
- `environmental_awareness`,
- `self_state_awareness`.
This file is the correct entry point when a user needs mission-level temporal supervision rather than only released sample-level manifests.
In practice:
- use `video_tasks/datasets/<scene>.latest_manifest.json` for benchmark inference and leaderboard-style evaluation;
- use `missions/<mission_id>/final_task/task_data.json` when mission-level temporal supervision or frame-level inspection is needed.
## 9. Media and Path Semantics
Image and video paths stored in manifests are release-facing references, not arbitrary internal cache paths.
For Stage 4:
- `reference_image_with_bbox` points to the released reference image with GT bbox overlay,
- `target_image` points to the released query observation.
Depending on task subtype and release path, Stage 4 media may point either to:
- released task assets under `task_data/.../image_tasks/assets/...`, or
- scene-level source views under `scene_data/.../landmarks_raw/rgb_views/...`.
For Stage 3:
- `video_path` points to the released main task video,
- `video_web_path` points to a web-playable derivative when available,
- `reference_image_with_bbox`, `overview_image`, and `keyframe_board_image` provide auxiliary evidence views,
- `task_data.json -> video.frames_manifest` and `frame_index_map` support frame-level inspection.
If `video_web_path` is empty for a given sample, downstream users should fall back to `video_path`.
## 10. Usage and Reproduction Pointers
This repository guide intentionally focuses on **release scope and data contracts**.
To avoid divergence and duplicated maintenance, detailed operational steps (environment setup, stage-by-stage commands, benchmark execution, and evaluation scripts) are not repeated here.
Please use the following as the canonical operational references:
- Benchmark website (recommended reading order and Usage page):
- https://uav-dualcog.lozumi.com/
- https://uav-dualcog.lozumi.com/usage/
- Official code repository (latest runnable commands and config templates):
- https://github.com/SmartDianLab/UAV-DualCog
- Simulator package used by construction-stage reproduction:
- https://www.kaggle.com/datasets/shuboliu/aerialvln-simulators
Practical split for external users:
- Use **this dataset guide** for file contracts, task semantics, and manifest field definitions.
- Use **website + GitHub** for concrete execution instructions and reproducibility workflows.
## 11. Benchmark Provenance
The public release is produced by the four-stage UAV-DualCog construction pipeline:
- Stage 1: scene point-cloud collection and fusion,
- Stage 2: landmark mining, review, and structured annotation,
- Stage 3: behavior-driven mission generation and video task construction,
- Stage 4: landmark-centered image QA generation.
The benchmark website provides:
- task explanations,
- prompt templates,
- examples,
- leaderboard views,
- analysis pages.
Official benchmark site:
- https://uav-dualcog.lozumi.com/
Official code repository:
- https://github.com/SmartDianLab/UAV-DualCog
## 12. Practical Notes for External Users
- Field names should be consumed in their canonical JSON form.
- Task ids such as `self_where` or `env_visibility_reasoning` should be treated as stable benchmark identifiers.
- Display names on the website are reader-facing aliases; manifests retain machine-facing ids.
- Some repository paths may differ slightly across mirrors or release bundles. The canonical structure is the contract described in this guide.
- For actual loading and benchmark evaluation, prefer the official GitHub implementation instead of reimplementing parsers from scratch:
- https://github.com/SmartDianLab/UAV-DualCog
- For detailed benchmark definitions, construction explanations, and usage walkthroughs, prefer the public benchmark website:
- https://uav-dualcog.lozumi.com/
- For simulator-backed reproduction, use the released AerialVLN simulator package:
- https://www.kaggle.com/datasets/shuboliu/aerialvln-simulators
## 13. Citation and License
- License: follow the repository card and platform metadata for the active release.
- Citation: cite the UAV-DualCog dataset release and record the repository version/date used in evaluation.
- Benchmark-facing supplementary explanations are maintained at:
- https://uav-dualcog.lozumi.com/
- Official loading and evaluation code is maintained at:
- https://github.com/SmartDianLab/UAV-DualCog
- Simulator dependency for reproduction is maintained at:
- https://www.kaggle.com/datasets/shuboliu/aerialvln-simulators
提供机构:
maas
创建时间:
2026-04-06



