five

robotflowlabs/nighthawk-mega

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/robotflowlabs/nighthawk-mega
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 task_categories: - image-classification - object-detection - image-to-image - image-to-text tags: - uav - drone - aerial-imagery - surveillance - thermal-imaging - multi-condition - domain-adaptation - night-augmentation - adverse-weather - captioned - yolo - gemma-4 - vlm-captions - webdataset size_categories: - 1M<n<10M pretty_name: Nighthawk Mega — 1.72M Multi-Condition UAV Imagery with VLM Captions configs: - config_name: source default: true data_files: - split: train path: webdataset/source-*.tar - config_name: day2night data_files: - split: train path: webdataset/day2night-*.tar - config_name: day2dusk data_files: - split: train path: webdataset/day2dusk-*.tar - config_name: day2fog data_files: - split: train path: webdataset/day2fog-*.tar - config_name: day2rain data_files: - split: train path: webdataset/day2rain-*.tar - config_name: rgb2thermal data_files: - split: train path: webdataset/rgb2thermal-*.tar - config_name: metadata data_files: - split: train path: metadata/all.parquet --- # Nighthawk Mega — 1.72M Captioned UAV Aerial Images Across 6 Conditions > **Every drone image, in every condition, fully captioned. The first dataset of its kind.** > > 1,718,541 captioned images · 5 synthesized adverse conditions · 1.4M YOLO labels · 5 trained translation models · One reproducible pipeline. ![Nighthawk Mega — 1.72M captioned UAV images across 6 conditions](assets/hero.png) _Same drone, six modalities: original RGB, night, dusk, fog, rain, thermal — every image captioned._ ## Why this exists UAV computer vision has a deployment problem. **Models are trained on daytime RGB. Real surveillance happens at night, in fog, in rain, with thermal sensors.** When models meet reality, they fail — sometimes catastrophically. Collecting and labeling drone footage in every adverse condition would take years and cost millions per dataset. So nobody does it. We did it anyway. Differently. We took **318,941 real aerial source images** aggregated from 13 public datasets, then used **trained CUT translation models** to synthesize the UAV subset across **5 adverse conditions** (night, dusk, fog, rain, thermal). Then we ran **Google Gemma-4 multimodal models** (E4B-it and 4-31B-it depending on source vs. condition) over **every single image** — all 1.72 million of them — to generate dense, condition-aware natural language captions. Plus paired YOLO detections on every translated image. Plus the trained translation models. Plus the entire pipeline, reproducible end-to-end. This is the largest publicly-released, fully-captioned, multi-condition UAV aerial dataset that exists. ## TL;DR by the numbers | What | Count | |---|---| | Total captioned images | **1,718,541** | | Source aerial images (13 datasets) | 318,941 | | UAV source images translated to 5 conditions | 279,920 | | Conditions synthesized per UAV image | 5 (night, dusk, fog, rain, thermal) | | Caption corpus size | 1.49 GB of natural language text (6.9 GB on disk) | | Mean caption length | ~134 words (median 123) | | YOLO bounding box labels | 1,399,600 (paired with every translated image) | | Translation models included | 5 (CUT-based, PyTorch + Safetensors) | | Total WebDataset size | ~172 GB (tar shards) | | GPU-hours to generate captions | ~25 (8× L4) | | License | Apache 2.0 | ## What's in the box | Subset | Images | Captions | YOLO Labels | Description | |-----------------------|---------:|---------:|------------:|---| | **source** | 318,941 | 318,941 | - | Original aerial imagery (10 UAV + LLVIP + MineInsight) | | **day2night** | 279,920 | 279,920 | 279,920 | Synthesized nighttime via CUT | | **day2dusk** | 279,920 | 279,920 | 279,920 | Synthesized dusk/twilight | | **day2fog** | 279,920 | 279,920 | 279,920 | Synthesized fog (CUDA atmospheric scattering kernel + CUT polish) | | **day2rain** | 279,920 | 279,920 | 279,920 | Synthesized rainy conditions | | **rgb2thermal** | 279,920 | 279,920 | 279,920 | Synthesized thermal infrared | | **TOTAL** | **1,718,541** | **1,718,541** | **1,399,600** | | **Storage:** ~172 GB WebDataset tar shards · 1.49 GB caption text · ~150 MB parquet metadata index. _Note: earlier plans referenced SAM3 segmentation masks — those were deferred to the v2 release and are **not** shipped in this corpus._ ## Source Datasets The 318,941 source images aggregate the following public aerial datasets. The 10 UAV subsets (totalling 279,920 images) are the ones translated across all 5 adverse conditions. LLVIP + MineInsight ship in `source` only, giving researchers real night-RGB and ground-LWIR imagery in the same repo. ### UAV subsets (translated across all 6 conditions) | Dataset | Images | Domain | |----------------------------|---------:|---| | BirdDrone (drone_full) | 110,781 | Drone vs bird classification, full sequences | | Seraphim | 75,138 | Annotated UAV detection bounding boxes | | BirdDrone (bird_full) | 30,225 | Bird sequences for negative samples | | DroneVehicle | 17,238 | Aerial vehicle detection | | Baidu UAV | 14,713 | Cloudy / clear UAV scenes | | DUT-Anti-UAV (full) | 10,000 | Anti-UAV surveillance | | VisDrone | 8,629 | Multi-class aerial detection | | BirdDrone (bird) | 6,500 | Curated bird samples | | DUT-Anti-UAV (curated) | 5,200 | Validation subset | | BirdDrone (drone) | 1,500 | Curated drone samples | | **UAV Subtotal** | **279,920** | | ### Source-only extras (present in `source` config, not translated) | Dataset | Images | Domain | |----------------------------|---------:|---| | MineInsight (RGB) | 21,741 | Underground mining, RGB surface imagery | | LLVIP (visible) | 15,488 | Paired RGB+IR street scenes (night-RGB reference) | | MineInsight (LWIR) | 1,792 | Underground mining, long-wave thermal | | **Extras Subtotal** | **39,021** | | **Source grand total: 318,941 images across 13 subsets.** Each subset retains its original split identity as the `subset` field inside every WebDataset sample. ## Why this dataset matters Most UAV detection and segmentation models are trained on daytime RGB only. Deployment reality is harsher — surveillance happens at night, in fog, in rain, with thermal cameras, often all at once. Collecting and labeling real footage in every condition would take years and cost millions. We took a different path: **synthesize the conditions, validate the synthesis, then caption everything densely** so downstream models can learn condition-aware representations from natural language supervision. The result is the first UAV dataset that: 1. **Covers all 5 adverse modalities** at scale (night, dusk, fog, rain, thermal), with ~280K image variants per condition 2. **Has dense natural-language captions** (mean ~134 words per image) on every single image — generated by the Gemma-4 family of multimodal VLMs 3. **Includes the trained CUT translation models** so you can synthesize new conditions for your own data 4. **Provides paired YOLO bounding-box annotations** on every translated image for multi-task learning (detection + captioning + condition classification) 5. **Documents the entire pipeline reproducibly**, from raw datasets to final captions ## How the conditions were synthesized Each adverse condition was generated by a **Contrastive Unpaired Translation (CUT)** model trained on real reference data: | Condition | Method | Reference data used | |--------------|---------------------------------------------------|---| | day2night | CUT (NCE=0.62, CLIP=0.72) | DroneVehicle-night (34K) + TIR-RGB-UAV (400) | | day2dusk | CUT (NCE=0.54, CLIP=0.75, early stop) | Curated dusk sequences | | day2rain | CUT (NCE=0.13, CLIP=0.81) | nuScenes rain scenes (39K) | | day2fog | CUDA atmospheric scattering kernel (sm_89) + CUT | Physical model + learned polish | | rgb2thermal | CUT (NCE=0.45, CLIP=0.74) | VT5000 + VT1000 + VT821 + scidb_satvideoirsdt | All translation models are released separately in the [Project Nighthawk model collection](https://huggingface.co/ilessio-aiflowlab/nighthawk-models) under Apache 2.0, in both PyTorch (`.pth`) and Safetensors formats. ## Caption generation Captions were generated using **Google Gemma-4 multimodal models** running on vLLM 0.19, configured for data-parallel inference on 8× NVIDIA L4 (23GB) GPUs. - **Source-image captioning**: `google/gemma-4-31B-it` (TP=4), fast pass on the 318K source images. - **Condition-aware captioning**: `google/gemma-4-E4B-it` with per-condition prompts (different prompt per modality — night/dusk/fog/rain prompts include explicit lighting/atmosphere cues; thermal reuses the source caption with a thermal-modality suffix). - **Sampling:** temperature=0 (deterministic), max_tokens=160–256. - **Throughput:** 15–23 img/s aggregate across 8 GPUs, depending on model size. - **Total compute:** ~25 GPU-hours for 1.72M captions. - **Quality:** mean caption length ~134 words (median 123); rich object/scene/lighting descriptions. ### Caption examples **Source (drone close-up):** > A close-up, low-angle shot of a white quadcopter drone captured against a plain, white background. The drone is positioned centrally in the frame, appearing to be in flight with its four propellers blurred to indicate motion. The drone features a white plastic chassis with a central body and four extending arms. Small green LED lights are visible on the front arms, and a red LED light is visible on the rear right arm. A camera gimbal system is mounted underneath the main body, pointing downwards... **day2night:** > This is a low-light, nighttime aerial image, likely translated from a daytime drone photograph. The scene is very dark, with reduced color saturation. The image is composed of several panels, suggesting a sequence or composite view. Visible elements include dark terrain or ground features. There are several distinct, illuminated objects that appear to be drones or aerial vehicles, visible in various poses across the panels... **day2fog:** > The image is a heavily fogged aerial view, characterized by low visibility and desaturated colors due to atmospheric scattering. The scene is dominated by a dense, white fog layer that obscures most terrain features. No distinct terrain, vehicles, people, buildings, or birds are clearly visible. The spatial layout is indistinct due to the fog density, which appears thick and pervasive across the entire visible area. **day2rain:** > A dark, low-visibility aerial image, likely taken from a drone in rainy conditions. The scene is dominated by a large, multi-rotor drone in the center, which appears to be equipped with various sensors or payloads hanging beneath it. The surrounding environment is indistinct due to heavy overcast and rain, showing muted, dark tones typical of a wet, gloomy day. ## Auxiliary annotations - **YOLOv8 detection labels** — auto-generated at 320×320, validated, paired with every translated image (1,399,600 total). Included inside each WebDataset sample as the `.cls` / `.json` sidecars. This makes Nighthawk Mega usable for multi-task learning: detection + captioning + cross-condition domain adaptation, all from a single corpus. ## Repository layout This dataset is distributed as **WebDataset tar shards** plus a small **Parquet metadata index**, both optimised for streaming. ``` robotflowlabs/nighthawk-mega/ ├── README.md # this file ├── LICENSE # Apache 2.0 ├── LICENSES_SOURCES.md # per-source-dataset attribution ├── QUICKSTART_TRAINING.md # 60-second training recipes ├── assets/ │ └── hero.png ├── metadata/ │ └── all.parquet # ~170 MB index of every sample (image_path, caption, subset, condition, stems, flags) └── webdataset/ ├── source-{0000..0053}.tar # 318,941 source samples ├── day2night-{0000..0032}.tar # 279,920 translated samples ├── day2dusk-{0000..0031}.tar ├── day2fog-{0000..0008}.tar ├── day2rain-{0000..0020}.tar └── rgb2thermal-{0000..0022}.tar ``` Each WebDataset sample is keyed by `<subset>__<stem>[_<condition>]` and contains: - `.jpg` — image bytes - `.txt` — caption (plain text, UTF-8) - `.json` — metadata blob (subset, condition, original filename, etc.) - `.cls` — condition class index (0=source, 1=day2night, 2=day2dusk, 3=day2fog, 4=day2rain, 5=rgb2thermal) ## Loading the dataset ### HuggingFace `datasets` (recommended for most users) ```python from datasets import load_dataset # Source RGB only (default config) ds_source = load_dataset("robotflowlabs/nighthawk-mega", "source", split="train", streaming=True) # Specific synthesized condition ds_night = load_dataset("robotflowlabs/nighthawk-mega", "day2night", split="train", streaming=True) for sample in ds_night: image = sample["jpg"] # PIL.Image caption = sample["txt"] # str meta = sample["json"] # dict with subset / condition / stem ``` ### Direct WebDataset (fastest for large-scale training) ```python import webdataset as wds URL = "https://huggingface.co/datasets/robotflowlabs/nighthawk-mega/resolve/main/webdataset/day2night-{0000..0032}.tar" ds = (wds.WebDataset(URL, resampled=True) .shuffle(1000).decode("pil").to_tuple("jpg", "txt")) ``` See [QUICKSTART_TRAINING.md](./QUICKSTART_TRAINING.md) for complete recipes (streaming, mixing conditions, parquet-based filtering). ## Use cases 1. **Domain-adaptive UAV detection** — train detectors that generalise across day/night/weather without expensive real-world collection. 2. **Vision-language model fine-tuning** — 1.72M dense aerial captions for VLM domain adaptation. 3. **Conditional image generation evaluation** — paired source + 5 conditions = ground truth for any image-to-image model. 4. **Robust feature learning** — contrastive losses across condition pairs of the same scene. 5. **Thermal modality research** — paired RGB↔thermal data for cross-modal alignment, plus real ground-LWIR from MineInsight. 6. **Multi-task learning benchmarks** — single dataset spans detection, captioning, and condition classification. ## Reproducibility All code is open source: - **Translation models + pipeline:** [github.com/RobotFlow-Labs/project_nighthawk](https://github.com/RobotFlow-Labs/project_nighthawk) - **Caption pipeline:** `scripts/caption_gemma4_fast.py` (source) + `scripts/caption_gemma4_e4b_translated.py` (conditions) in the repo - **Trained model checkpoints:** [ilessio-aiflowlab/nighthawk-models](https://huggingface.co/ilessio-aiflowlab/nighthawk-models) - **CUDA kernels for fog scattering:** included in repo (`nighthawk_kernels.cu`, sm_89, ~967 img/s) ## Hardware used Generation pipeline ran on **8× NVIDIA L4 (23GB each)** for roughly 25 GPU-hours of caption inference plus CUT translation time: - Source captioning (318K, Gemma-4-31B): ~18 hours aggregate - Condition-aware captioning (5 × 280K, Gemma-4-E4B): ~7 hours aggregate - Translation passes (CUT × 4 conditions + fog kernel): several hours per condition Inference on the released translation models is realistic on a single 8GB consumer GPU. ## Limitations and biases - **Synthesized conditions are not real conditions.** day2night was trained on real night UAV references, but it's still a learned approximation. Models trained purely on Nighthawk should be validated on real adverse-condition footage before production use. - **Captions are model-generated.** Gemma-4 is strong but not perfect — captions occasionally hallucinate fine details, especially in foggy/dark scenes. Sample 50 captions before assuming exact factuality. - **Mixed caption models**: source captions come from Gemma-4-31B, condition captions come from Gemma-4-E4B. Distribution of caption style/length is not perfectly uniform across conditions. - **Source data biases inherit.** The 10 UAV source datasets skew toward Asian and European drone footage. LLVIP adds night street-scene RGB. MineInsight adds underground RGB + LWIR. Broader geographic coverage is planned for v2. - **SAM3 segmentation masks are NOT included** in this release — they were deferred to v2. ## Citation If you use Nighthawk Mega, please cite: ```bibtex @dataset{nighthawk_mega_2026, title = {Nighthawk Mega: Multi-Condition UAV Aerial Imagery with Dense VLM Captions}, author = {AIFlow Labs / RobotFlow Labs}, year = {2026}, url = {https://huggingface.co/datasets/robotflowlabs/nighthawk-mega}, note = {1.72M captioned images across 5 synthesized adverse conditions} } ``` Please also cite the underlying source datasets you use — see the **Source Datasets** section above for the full list, and `LICENSES_SOURCES.md` for attribution details. ## License The synthesized images, captions, translation models, and pipeline code are released under **Apache 2.0**. The original source images are redistributed under their respective original licenses; see `LICENSES_SOURCES.md` for per-source attribution and terms. Commercial users must independently verify source dataset licenses before redistribution. ## Acknowledgements - **Google DeepMind** for releasing the Gemma-4 family with strong multimodal capabilities - **vLLM project** for the inference engine that made 8-way data-parallel captioning fast - **The 13 source dataset authors** — this work would not exist without their original collection efforts - **Anthropic Claude Code** for orchestrating the multi-day captioning pipeline ## Status - [x] Source captioning (318,941 images, Gemma-4-31B) - [x] day2night captioning (279,920 images, Gemma-4-E4B) - [x] day2dusk captioning (279,920 images, Gemma-4-E4B) - [x] day2fog captioning (279,920 images, Gemma-4-E4B) - [x] day2rain captioning (279,920 images, Gemma-4-E4B) - [x] rgb2thermal captioning (279,920 images, Gemma-4-E4B) - [x] WebDataset shards built (172 GB, 172 tar files) - [x] Parquet metadata index built (~170 MB) - [x] Upload to HuggingFace (in progress, resumable) - [ ] v2: SAM3 segmentation masks - [ ] v2: broader geographic coverage (Shenzhen + Taiwan collections) ## For researchers in a hurry If you only have 5 minutes, do this: ```python from datasets import load_dataset ds = load_dataset("robotflowlabs/nighthawk-mega", "rgb2thermal", split="train", streaming=True) sample = next(iter(ds)) print(sample["txt"]) # ~134 words describing the thermal aerial scene sample["jpg"].show() # the synthesized thermal image ``` That's it. One line, one of the largest paired-modality aerial corpora ever released. ## Comparison with other UAV datasets | Dataset | Images | Captioned? | Multi-condition? | Thermal? | Year | |---|---:|:---:|:---:|:---:|:---:| | VisDrone (2018–2021) | ~10K | No | No | No | 2018 | | UAVDT | 80K | No | No | No | 2018 | | LLVIP | 30K | No | No | RGB+IR pairs | 2021 | | AntiUAV | 318K | No | No | RGB+IR sequences | 2023 | | BirdDrone | 145K | No | No | No | 2024 | | **Nighthawk Mega** | **1.72M** | **Yes — every image** | **5 conditions** | **Yes — full set + real LWIR** | **2026** | Nighthawk Mega isn't competing with these datasets. It's built **on top of them** — re-aggregated, re-rendered across conditions, and densely captioned. ## Share this work If Nighthawk Mega helps your research: - Star the repo: [github.com/RobotFlow-Labs/project_nighthawk](https://github.com/RobotFlow-Labs/project_nighthawk) - Cite the dataset (BibTeX above) - Tag us: **@AIFlowLabs**, **@RobotFlowLabs** ### One-line tweet (steal this) > 1,718,541 fully-captioned UAV aerial images across 6 conditions (day/night/dusk/fog/rain/thermal). 5 trained translation models. YOLO labels included. Apache 2.0. Built in 25 GPU-hours on 8× L4. https://huggingface.co/datasets/robotflowlabs/nighthawk-mega --- **Built by AIFlow Labs · RobotFlow Labs · 2026** _Want to use this in production? Need a custom variant? Want collaboration on v2? Open an issue on the GitHub repo or reach out via the HF Discussions tab._
提供机构:
robotflowlabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作