allenai/WildDet3D-Data

Name: allenai/WildDet3D-Data
Creator: allenai
Published: 2026-04-20 01:20:35
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/allenai/WildDet3D-Data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - object-detection tags: - 3d-object-detection - 3d-bounding-box - monocular-3d - in-the-wild - depth-estimation pretty_name: WildDet3D-Data size_categories: - 1M<n<10M --- # WildDet3D-Data: Dataset Preparation Guide ## Overview WildDet3D-Data consists of 3D bounding box annotations for in-the-wild images from COCO, LVIS, Objects365, and V3Det. The dataset is split into: | Split | Description | Annotation Source | Images | Annotations | Categories | |-------|-------------|-------------------|--------|-------------|------------| | **Train (Human)** | Human-reviewed annotations only | Human | 102,979 | 229,934 | 11,879 | | **Train (Essential)** | Human + VLM-qualified small objects | Human + VLM | 102,979 | 412,711 | 12,064 | | **Train (Synthetic)** | VLM auto-selected annotations | VLM | 896,004 | 3,483,292 | 11,896 | For val/test benchmarks, see [WildDet3D-Bench](https://huggingface.co/datasets/allenai/WildDet3D-Bench). ## Directory Structure After downloading and extracting, the dataset should be organized as: ``` WildDet3D-Data/ ├── README.md ├── annotations/ │ ├── InTheWild_v3_train_human_only.json # Train (Human) — COCO, LVIS, Obj365 │ ├── InTheWild_v3_train_human.json # Train (Essential) — COCO, LVIS, Obj365 │ ├── InTheWild_v3_train_synthetic.json # Train (Synthetic) — COCO, LVIS, Obj365 │ ├── InTheWild_v3_v3det_human_only.json # Train (Human) — V3Det │ ├── InTheWild_v3_v3det_human.json # Train (Essential) — V3Det │ ├── InTheWild_v3_v3det_synthetic.json # Train (Synthetic) — V3Det │ └── InTheWild_v3_*_class_map.json # Category mappings ├── depth/{split}/ # Monocular depth maps (extract from .tar.gz) │ └── {source}_{formatted_id}.npz # float32 .npz at original resolution ├── camera/{split}/ # Camera parameters (extract from .tar.gz) │ └── {source}_{formatted_id}.json # Camera intrinsics (K) ├── masks/ # SAM2 instance masks (optional, Stage 3 training) │ ├── obj365/ │ │ ├── obj365_train_with_masks.json # ~5.4 GB │ │ └── obj365_val_with_masks.json # ~268 MB │ └── v3det/ │ └── v3det_2023_v1_masks_all.json # ~2.0 GB └── images/ # Downloaded separately (see Step 2) ├── coco_train/ ├── obj365_train/ └── v3det_train/ ``` ## Depth and Camera File Naming Depth maps and camera parameters are named as `{source}_{formatted_id}`, where `{source}` is derived from the image's `file_path` field in the annotation JSON: | file_path | Depth / Camera filename | |-----------|------------------------| | `images/coco_val/000000000724.jpg` | `coco_val_000000000724.npz/.json` | | `images/coco_train/000000262686.jpg` | `coco_train_000000262686.npz/.json` | | `images/obj365_train/obj365_train_000000628903.jpg` | `obj365_train_000000628903.npz/.json` | | `images/v3det_train/Q100507578/28_284_....jpg` | `v3det_train_000000000915.npz/.json` | **Note:** Some images from COCO and LVIS share the same underlying image file (LVIS uses COCO images). These appear as separate entries in the annotation JSON (with different annotations) but map to the same depth/camera file. To load the depth/camera for an image entry, extract the source prefix from `file_path.split("/")[1]` and combine with `formatted_id`. ```python # Example: load depth and camera for an image img = data["images"][0] source = img["file_path"].split("/")[1] # e.g., "coco_train" fid = img["formatted_id"] # e.g., "000000262686" depth_mm = np.load(f"depth/{split}/{source}_{fid}.npz")["depth"] # float32, (H, W), in mm depth_m = depth_mm / 1000.0 # convert to meters camera = json.load(open(f"camera/{split}/{source}_{fid}.json")) ``` ### Depth Format Each `.npz` file contains a single key `"depth"` with a float32 2D array at original image resolution. **Values are in millimeters (mm).** To convert to meters: `depth_m = depth_mm / 1000.0`. ### Camera Format Each `.json` file contains: ```json { "K": [[fx, 0, cx], [0, fy, cy], [0, 0, 1]], "image_size": [height, width] } ``` - **`K`**: Camera intrinsic matrix (3x3), at original image resolution - **`image_size`**: `[height, width]` of the original image ## Mask Annotations (optional, Stage 3 training only) Stage 3 training samples positive points from instance masks. COCO and LVIS already ship masks in their official annotations, so we only provide SAM2-generated masks for Objects365 and V3Det (neither ships masks upstream). Stages 1 and 2 do not use masks at all. ### What we host | File | Size | Used by | |---|---|---| | `masks/obj365/obj365_train_with_masks.json` | ~5.4 GB | Stage 3, Obj365 train | | `masks/obj365/obj365_val_with_masks.json` | ~268 MB | Stage 3, Obj365 val | | `masks/v3det/v3det_2023_v1_masks_all.json` | ~2.0 GB | Stage 3, V3Det train | Each JSON is COCO-format with SAM2-predicted `segmentation` masks added alongside the existing detection annotations. Load as a normal COCO JSON; use `pycocotools` to decode the RLE masks. ### What you need to download yourself For COCO and LVIS, grab the official annotation JSONs (they already contain `segmentation` for every annotation) and place them next to the SAM2 files: ```bash # COCO train/val 2017 annotations (segmentation included) wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip unzip annotations_trainval2017.zip # yields annotations/instances_{train,val}2017.json # LVIS v1 train/val annotations (segmentation included) wget https://dl.fbaipublicfiles.com/LVIS/lvis_v1_train.json.zip wget https://dl.fbaipublicfiles.com/LVIS/lvis_v1_val.json.zip unzip lvis_v1_train.json.zip unzip lvis_v1_val.json.zip ``` Expected layout under `data/masks/` once all four sources are in place: ``` data/masks/ ├── coco/annotations/instances_{train,val}2017.json # COCO official ├── lvis/lvis_v1_{train,val}.json # LVIS official ├── obj365/obj365_{train,val}_with_masks.json # our SAM2 (this HF) └── v3det/v3det_2023_v1_masks_all.json # our SAM2 (this HF) ``` ## Step 1: Download and Extract ```bash pip install huggingface_hub # Download only annotations huggingface-cli download weikaih/WildDet3D-Data --repo-type dataset --include "annotations/*" --local-dir WildDet3D-Data # Download specific splits (e.g., val only) huggingface-cli download weikaih/WildDet3D-Data --repo-type dataset --include "packed/depth_val.tar.gz" "packed/camera_val.tar.gz" --local-dir WildDet3D-Data # Download everything huggingface-cli download weikaih/WildDet3D-Data --repo-type dataset --local-dir WildDet3D-Data ``` ### Extract Depth Maps Depth maps are provided as compressed archives. Large splits are split into multiple parts. The `train_human` / `train_synthetic` archives have their `.npz` files at the tar root (no wrapping directory), so extract each of them into a subdir that matches `{split}` in the code. The `v3det_*` archives already wrap their contents in a `v3det_{human,synthetic}/` dir and extract directly. ```bash # Train Human (2 parts, extract into depth/train_human/) mkdir -p depth/train_human tar xzf packed/depth_train_human_part000.tar.gz -C depth/train_human tar xzf packed/depth_train_human_part001.tar.gz -C depth/train_human # V3Det Human (single file, tar already contains v3det_human/ prefix) tar xzf packed/depth_v3det_human.tar.gz -C depth # Train Synthetic (16 parts, extract into depth/train_synthetic/) mkdir -p depth/train_synthetic for i in $(seq -w 0 15); do tar xzf packed/depth_train_synthetic_part${i}.tar.gz -C depth/train_synthetic done # V3Det Synthetic (7 parts, tar already contains v3det_synthetic/ prefix) for i in $(seq -w 0 6); do tar xzf packed/depth_v3det_synthetic_part00${i}.tar.gz -C depth done ``` After extraction you should have `depth/train_human/<source>_<formatted_id>.npz` and `depth/v3det_human/v3det_train_<formatted_id>.npz`, etc. ### Extract Camera Parameters ```bash mkdir -p camera && cd camera for f in ../packed/camera_*.tar.gz; do tar xzf "$f"; done cd .. ``` After extraction, you should have `depth/{split}/` and `camera/{split}/` directories with individual files per image. ## Step 2: Download Source Images Images must be downloaded from their original sources and organized into the following structure: ``` images/ ├── coco_train/ # COCO train2017 (includes LVIS images) ├── obj365_train/ # Objects365 training └── v3det_train/ # V3Det training ``` ### COCO train2017 ```bash wget http://images.cocodataset.org/zips/train2017.zip unzip train2017.zip mkdir -p images/coco_train mv train2017/* images/coco_train/ ``` ### Objects365 ```bash # Objects365 — download from https://www.objects365.org/ mkdir -p images/obj365_train # Images should be named: obj365_train_000000XXXXXX.jpg ``` ### V3Det Used by: Train V3Det splits only ```bash # V3Det — download from https://v3det.openxlab.org.cn/ mkdir -p images/v3det_train # Directory structure: images/v3det_train/{category_folder}/{image}.jpg # e.g., images/v3det_train/Q100507578/28_284_50119550013_7d06ded882_c.jpg ``` | Source | Directory | |--------|-----------| | COCO train2017 | `images/coco_train/` | | Objects365 train | `images/obj365_train/` | | V3Det train | `images/v3det_train/` | ## Annotation Format (COCO3D) Each annotation JSON follows the COCO3D format: ```json { "info": {"name": "InTheWild_v3_val"}, "images": [{ "id": 0, "width": 375, "height": 500, "file_path": "images/coco_val/000000000724.jpg", "K": [[fx, 0, cx], [0, fy, cy], [0, 0, 1]] }], "categories": [{"id": 0, "name": "stop sign"}], "annotations": [{ "id": 0, "image_id": 0, "category_id": 0, "category_name": "stop sign", "bbox2D_proj": [x1, y1, x2, y2], "center_cam": [cx, cy, cz], "dimensions": [width, height, length], "R_cam": [[r00, r01, r02], [r10, r11, r12], [r20, r21, r22]], "bbox3D_cam": [[x, y, z], ...], "valid3D": true }] } ``` **Image fields:** - **`K`**: Camera intrinsic matrix (3x3), at original image resolution - **`file_path`**: Relative path to the source image **Annotation fields:** - **`valid3D`**: `true` = valid 3D annotation, `false` = 3D box is filtered out (see note below) - **`center_cam`**: 3D box center in camera coordinates (meters) - **`dimensions`**: `[width, height, length]` in meters (Omni3D convention) - **`R_cam`**: 3x3 rotation matrix in camera coordinates (gravity-aligned, local Y = up) - **`bbox3D_cam`**: 8 corner points of the 3D bounding box in camera coordinates - **`bbox2D_proj`**: 2D bounding box `[x1, y1, x2, y2]` at original image resolution **Important: `valid3D` filtering.** Each annotation always has a valid 2D bounding box (`bbox2D_proj`), but the 3D box fields (`center_cam`, `dimensions`, `R_cam`, `bbox3D_cam`) should only be used when `valid3D=true`. Annotations with `valid3D=false` have 3D boxes that were filtered out due to quality checks (human rejection, size/geometry filtering, or depiction filtering) — their 3D fields contain placeholder values and should be ignored. The annotation counts in the overview table refer to `valid3D=true` annotations only. For training, filter annotations by `valid3D`: ```python for ann in data["annotations"]: if ann["valid3D"]: # Use both 2D and 3D annotations ... else: # 2D box is still valid, but skip 3D box ... ``` ## Which Files to Use | Use Case | Annotation Files | |----------|-----------------| | Train (Human only) | `InTheWild_v3_train_human_only.json` + `InTheWild_v3_v3det_human_only.json` | | Train (Essential) | `InTheWild_v3_train_human.json` + `InTheWild_v3_v3det_human.json` | | Train (Synthetic) | `InTheWild_v3_train_synthetic.json` + `InTheWild_v3_v3det_synthetic.json` | | Train (All) | Essential + Synthetic (all 4 files) | ## License - **Annotations**: CC BY 4.0 ## Paper [WildDet3D: Scaling Promptable 3D Detection in the Wild](https://arxiv.org/abs/2604.08626)

提供机构：

allenai

5,000+

优质数据集

54 个

任务类型

进入经典数据集