sirbastiano94/routerset
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sirbastiano94/routerset
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: routerset
license: mit
size_categories:
- 10K<n<100K
task_categories:
- image-classification
- zero-shot-image-classification
language:
- en
tags:
- remote-sensing
- earth-observation
- multispectral
- multilabel
- satellite-imagery
viewer: false
---
# routerset
`routerset` is a materialized multi-label remote sensing dataset assembled from the current `phi2FM` downstream sources.
## What is kept in this repo
- `manifest.jsonl`: canonical record manifest
- `images/`: materialized `.npy` samples
- `label_vocab.json`: label vocabulary
- `summary.json`: dataset build summary
- `materialization_summary.json`: materialization results
- `plots/`: dataset analysis graphs
The repo is intentionally minimal. There is one canonical dataset layout at the root.
## Current snapshot
- Records: `15,731`
- Materialized samples: `15,731` `.npy` files
- Included datasets: `fire`, `burned_area`, `anomaly_detection`, `worldfloods`, `lc`, `roads`
- Missing dataset: `building`
- Label cardinality: min `0`, max `4`, mean `1.2793`
### Per-dataset totals
- `fire`: `1,600`
- `burned_area`: `1,299`
- `anomaly_detection`: `9,216`
- `worldfloods`: `2,353`
- `lc`: `63`
- `roads`: `1,200`
### Status counts
- `positive`: `14,251`
- `explicit_negative`: `949`
- `below_threshold`: `531`
### Most frequent labels
- `water`: `7,570`
- `land`: `4,789`
- `cloud`: `3,123`
- `turbid_water`: `1,623`
- `road_present`: `1,031`
- `burned_area`: `688`
## Analysis plots
The dataset graphs are stored in `plots/`:
- `coverage_distributions.png`
- `dataset_label_heatmap.png`
- `dataset_split_counts.png`
- `label_cooccurrence.png`
- `label_frequency.png`
- `status_and_cardinality.png`
- `plot_summary.json`
## Notes
- Arrays are stored as `.npy` materializations.
- The dataset was built from the current local `phi2FM` downstream sources and then uploaded to this Hugging Face dataset repository.
- `burned_area` was rebuilt from the original OEOBench `256x256` source scenes. The raw routerset artifact now stores one native `7x256x256` burned-area record per selected source scene instead of eight derived `64x128` subpatches.
- `roads` raw rows are rebuilt as `256x256x10` `uint16` 2x2 mosaics from the published `500_shot_roads` archive. They keep native `road_present` coverage and add heuristic `cloud` / `land` / `water` weak labels under `label_source = native+heuristic_weak`.
- `anomaly_detection` raw rows are rebuilt as deterministic `8x256x256` tiles from the original source zarr, with aligned edge tiles when needed, instead of `4096x4096` full-scene rows.
- The local rebuild helpers are `scripts/rebuild_routerset_burned_area_from_source.py` and `scripts/rebuild_routerset_roads_anomaly_from_source.py`. They back up the existing metadata snapshot, rewrite the raw `.npy` files and manifests from source, refresh summary/audit metadata, and can publish the corrected snapshot to Hugging Face.
- A corrected canonical `8x256x256` export for local training and audit can be generated with `make routerset-materialize`, which writes to `outputs/routerset/materialized_256/` by default. Raw `roads`/`lc` records are mapped to the student Sentinel-2 layout and scaled by `1/10000`; float-domain student experts (`fire`, `burned_area`, `worldfloods`, `anomaly_detection`) are channel-adapted and then min-max normalized per image before padding. The canonical raw routerset snapshot now stores native `256x256` burned-area scenes, `256x256` anomaly tiles, and `256x256` roads mosaics.
- Add `--selected-only` to `scripts/materialize_routerset_dataset.py` when exporting only a subset of experts and you need a self-contained manifest without passthrough rows from the other tasks.
- A clean variant can be generated with `make routerset-materialize-clean`, which writes `manifest_256.jsonl`, `fault_rows_256.jsonl`, and `fault_report.json` under `outputs/routerset/materialized_256_clean/` by default.
- A full tile-by-tile audit can be generated with `make routerset-audit MATERIALIZED_DATASET_DIR=outputs/routerset/fix27March`, which writes `audit/tile_audit.jsonl`, `audit/audit_summary.json`, and RGB / false-RGB plot previews under that dataset root.
- A full file-by-file audit over the rebuilt raw routerset snapshot can be generated with `make routerset-raw-audit ROUTERSET_DIR=routerset`, which writes `summary.json`, `file_audit.jsonl`, `sample_rows.json`, and summary plots under `routerset/audit_raw/`.
- An executed raw-sample gallery notebook can be generated with `env PYTHONPATH=src .venv/bin/python scripts/generate_routerset_raw_audit_notebook.py --execute`, which writes [notebooks/routerset_raw_audit.ipynb](/shared/home/rdelprete/PythonProjects/hydranet-phisat2/notebooks/routerset_raw_audit.ipynb).
- The clean export only removes objective row-level artifact faults such as all-zero materialized tiles. Split-level problems, for example `fire` validation having no positive rows, stay reported as blockers instead of being rewritten silently.
- The dedicated notebook for the rebuilt `fix27March` artifact is `notebooks/routerset_fix27March_audit.ipynb`.
提供机构:
sirbastiano94



