NickWright/OmniCloudMask-Combined-Training-Dataset

Name: NickWright/OmniCloudMask-Combined-Training-Dataset
Creator: NickWright
Published: 2026-03-05 00:28:32
License: 暂无描述

Hugging Face2026-03-05 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/NickWright/OmniCloudMask-Combined-Training-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - image-segmentation tags: - remote-sensing - semantic-segmentation - sentinel-2 - cloud-detection - earth-observation - omnicloudmask pretty_name: OmniCloudMask Combined Training Dataset size_categories: - 100K<n<1M --- # OmniCloudMask Combined Training Dataset A combined multi-source dataset for training cloud and cloud shadow segmentation models on Sentinel-2 satellite imagery. The dataset contains **103,548 image-label pairs** (100,528 training + 1,070 validation + 1,950 test) drawn from 4 source datasets: CloudSEN12, Kappaset, OCM hard negative, and OCM scribble. CloudSEN12 is represented in several variants (different processing levels, super-resolution, and re-downloaded imagery) to improve model generalisation. This dataset was used to train the v4 weights of OmniCloudMask. | | | |---|---| | **Project** | [github.com/DPIRD-DMA/OmniCloudMask](https://github.com/DPIRD-DMA/OmniCloudMask) | | **Paper** | [Training sensor-agnostic deep learning models for remote sensing (RSE, 2025)](https://doi.org/10.1016/j.rse.2025.114694) | | **Documentation** | [omnicloudmask.readthedocs.io](https://omnicloudmask.readthedocs.io/en/latest/) | | **Spatial Distribution** | [dpird-dma.github.io/OCM-training-data-map](https://dpird-dma.github.io/OCM-training-data-map/) | ## Sentinel-2 Bands Each image contains 3 spectral bands stored as a 3-channel GeoTIFF: | Channel | Sentinel-2 Band | Description | Native GSD | |---------|-----------------|-------------|------------| | 0 | B04 | Red | 10 m | | 1 | B03 | Green | 10 m | | 2 | B8A | NIR Narrow | 20 m (upsampled to 10 m) | ## Label Classes 4 semantic classes plus an ignore value: | Value | Class | Description | |-------|-------|-------------| | 0 | Clear | No cloud or shadow | | 1 | Thick Cloud | Opaque cloud | | 2 | Thin Cloud | Semi-transparent cloud | | 3 | Cloud Shadow | Shadow cast by clouds | | 99 | No-data | Ignore during training | ## Storage Format The dataset is stored as Parquet shards. Each row contains one image-label pair with the following columns: | Column | Type | Description | |--------|------|-------------| | `subset` | string | Source sub-dataset name (e.g. `"CloudSEN12 high"`) | | `processing_level` | string | `"L1C"`, `"L2A"`, or `""` | | `image_filename` | string | Original filename for traceability | | `label_filename` | string | Original label filename | | `image` | binary | Raw GeoTIFF bytes (3-band, uint16, LZW compressed) | | `label` | binary | Raw GeoTIFF bytes (1-band, uint8, LZW compressed) | ### GeoTIFF Details - **Image dtype:** `uint16` — standard Sentinel-2 encoding (reflectance × 10,000, e.g. values in the 0–10,000+ range) - **Label dtype:** `uint8` - **Geolocation:** CRS and affine transform preserved inside each GeoTIFF (UTM projections, WGS84 datum). Exception: Kappaset images are not georeferenced. **Kappaset note:** The original Kappaset NetCDF files store band values normalised by 65,535 (uint16 max). During conversion to GeoTIFF, values are multiplied by 65,535 to restore standard Sentinel-2 DN scale. ### Usage Example ```python import io import rasterio from datasets import load_dataset ds = load_dataset("NickWright/OmniCloudMask-Combined-Training-Dataset", split="train") row = ds[0] with rasterio.open(io.BytesIO(row["image"])) as src: image = src.read() # shape: (3, H, W), dtype: uint16 crs = src.crs # e.g. EPSG:32719 with rasterio.open(io.BytesIO(row["label"])) as src: label = src.read(1) # shape: (H, W), dtype: uint8 ``` ## Image Sizes | Pixel Dimensions | Approximate Ground Coverage | Datasets | |------------------|----------------------------|----------| | 509 x 509 px | 5.09 x 5.09 km | CloudSEN12 high, scribble, validation, test, Planetary Computer, super res tiles, Kappaset, OCM hard negative, OCM scribble | | 1018 x 1018 px | 5.09 x 5.09 km (5 m) | CloudSEN12 super res raw | | 2000 x 2000 px | 20 x 20 km | CloudSEN12 2k | **Why 509 instead of 512?** The CloudSEN12 dataset — the largest and highest-quality source in this collection — uses 509 x 509 px tiles. To maintain consistency, all other datasets adopt the same dimensions. Kappaset images (originally 512 x 512 px) are clipped to 509 x 509 px to match. ## Dataset Sources ### CloudSEN12 High (16,980 images — 8,490 L1C + 8,490 L2A) High-quality dense pixel-wise labels from the CloudSEN12 dataset. Includes both L1C (top-of-atmosphere) and L2A (surface reflectance) processing levels for each scene, sharing the same label mask. - **Source:** [TACO Foundation on HuggingFace](https://huggingface.co/datasets/tacofoundation/cloudsen12) (`cloudsen12-l1c`, `cloudsen12-l2a`) - **Size:** 509 x 509 px - **Label type:** Dense, human-annotated - **Split used:** Train only ### CloudSEN12 Scribble (20,000 images — 10,000 L1C + 10,000 L2A) Sparse scribble annotations covering all splits. Original 7 classes remapped to 4: | Original | Remapped | Meaning | |----------|----------|---------| | 0 | 0 | Clear | | 1, 2 | 1 | Thick Cloud | | 3, 4 | 2 | Thin Cloud | | 5, 6 | 3 | Cloud Shadow | | 99 | 99 | No-data | - **Source:** [TACO Foundation on HuggingFace](https://huggingface.co/datasets/tacofoundation/cloudsen12) - **Size:** 509 x 509 px - **Label type:** Sparse scribble annotations (most pixels are 99/no-data) - **Splits used:** Train + Val + Test ### CloudSEN12 2k (1,694 images — 847 L1C + 847 L2A) Larger tiles from CloudSEN12 with dense labels. Both L1C and L2A processing levels. - **Source:** [TACO Foundation on HuggingFace](https://huggingface.co/datasets/tacofoundation/cloudsen12) - **Size:** 2000 x 2000 px - **Label type:** Dense, human-annotated - **Splits used:** Train + Val + Test ### CloudSEN12 Planetary Computer (8,403 images — L2A only) The same scenes as CloudSEN12 high, but the L2A imagery was re-downloaded from Microsoft Planetary Computer. Labels are identical to CloudSEN12 high. This provides imagery processed through a different atmospheric correction pipeline, improving model generalisation. - **Source:** [Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/) STAC API, `sentinel-2-l2a` collection - **Size:** 509 x 509 px - **Label type:** Dense, human-annotated (same labels as CloudSEN12 high) - **Processing level:** L2A only - **Note:** ~87 scenes could not be matched on Planetary Computer and were skipped ### CloudSEN12 Super Resolution Tiles (33,960 images — L1C only) Derived from CloudSEN12 high L1C train images using a 2x ESRGAN super-resolution model. Each 509x509 source image is upscaled to 1018x1018 px (~5 m effective resolution), then split into a 2x2 grid of 509x509 tiles. Labels are pixel-repeated to match. Colour statistics (mean, std) are transferred from the original image back to the super-resolved output to preserve radiometric consistency. - **Super-resolution model:** [`Phips/2xNomosUni_esrgan_multijpg`](https://huggingface.co/Phips/2xNomosUni_esrgan_multijpg) - **Size:** 509 x 509 px (4 tiles per source image) - **Label type:** Dense (pixel-repeated from original) - **Processing level:** L1C only ### CloudSEN12 Super Resolution Raw (8,490 images — L1C only) Same super-resolution pipeline as above, but stored as full 1018x1018 px images (not tiled). - **Size:** 1018 x 1018 px - **Label type:** Dense (pixel-repeated from original) - **Processing level:** L1C only ### Kappaset (9,250 images — L1C only) An independent cloud labelling dataset converted from NetCDF to GeoTIFF. Original 6 classes remapped to 4: | Original | Remapped | Meaning | |----------|----------|---------| | 0 | 99 | No-data | | 1 | 0 | Clear | | 2 | 3 | Cloud Shadow | | 3 | 2 | Thin Cloud | | 4 | 1 | Thick Cloud | | 5 | 99 | No-data | - **Source:** [Zenodo record 7100327](https://zenodo.org/records/7100327) - **Size:** 509 x 509 px - **Label type:** Dense, human-annotated - **Processing level:** L1C only ### OCM hard negative (920 images — L2A only) Cloud-free scenes that the model previously misclassified as cloudy. All labels are entirely class 0 (clear). These scenes were specifically curated to include cloud-like surfaces (snow, sand, haze, bright surfaces). - **Source:** [Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/) (`sentinel-2-l2a`), custom curated - **Size:** 509 x 509 px - **Label type:** All-zero masks (every pixel = clear) - **Processing level:** L2A - **Scene dates:** 2018–2024, global coverage ### OCM scribble (831 images — L2A only) Custom scribble-annotated scenes downloaded from Planetary Computer, targeting scenarios underrepresented in CloudSEN12 and Kappaset. - **Source:** [Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/) (`sentinel-2-l2a`), custom curated - **Size:** 509 x 509 px - **Label type:** Sparse scribble annotations - **Processing level:** L2A ### CloudSEN12 Validation (1,070 images — 535 L1C + 535 L2A) Held-out validation set with dense labels. Used only for evaluation, never for training. - **Source:** [TACO Foundation on HuggingFace](https://huggingface.co/datasets/tacofoundation/cloudsen12) - **Size:** 509 x 509 px - **Label type:** Dense, human-annotated - **Processing levels:** L1C and L2A ### CloudSEN12 Test (1,950 images — 975 L1C + 975 L2A) Held-out test set with dense labels. Used only for final evaluation, never for training or validation. - **Source:** [TACO Foundation on HuggingFace](https://huggingface.co/datasets/tacofoundation/cloudsen12) - **Size:** 509 x 509 px - **Label type:** Dense, human-annotated - **Processing levels:** L1C and L2A ## Image Count Summary | Dataset | Images | L1C | L2A | Role | |---------|-------:|----:|----:|------| | CloudSEN12 high | 16,980 | 8,490 | 8,490 | Train | | CloudSEN12 scribble | 20,000 | 10,000 | 10,000 | Train | | CloudSEN12 2k | 1,694 | 847 | 847 | Train | | CloudSEN12 high planetary computer | 8,403 | — | 8,403 | Train | | CloudSEN12 high super res tiles | 33,960 | 33,960 | — | Train | | CloudSEN12 high super res raw | 8,490 | 8,490 | — | Train | | Kappaset | 9,250 | 9,250 | — | Train | | OCM Hard negative | 920 | — | 920 | Train | | OCM scribble | 831 | — | 831 | Train | | CloudSEN12 validation | 1,070 | 535 | 535 | Val | | CloudSEN12 test | 1,950 | 975 | 975 | Test | | **Total** | **103,548** | **72,547** | **31,001** | | ## Dataset Weights Each sub-dataset is assigned a loss weight during training to reflect label quality and reliability: | Dataset | Weight | |---------|-------:| | CloudSEN12 high | 1.0 | | CloudSEN12 scribble | 1.0 | | CloudSEN12 2k | 0.8 | | CloudSEN12 high super res tiles | 1.1 | | CloudSEN12 high super res raw | 1.0 | | CloudSEN12 high planetary computer | 1.0 | | Kappaset | 0.2 | | OCM Hard negative | 0.7 | | OCM scribble | 1.1 | ## Citations If you use this dataset, please cite the original sources: - **CloudSEN12:** Aybar, C., et al. "CloudSEN12, a global dataset for semantic understanding of cloud and cloud shadow in Sentinel-2." *Sci Data*, 2022. [Paper](https://doi.org/10.1038/s41597-022-01878-2) | [Project page](https://cloudsen12.github.io/) | [Dataset](https://huggingface.co/datasets/tacofoundation/cloudsen12) - **Kappaset:** Domnich, M., et al. [Paper](https://doi.org/10.3390/rs13204100) | [Dataset](https://zenodo.org/records/7100327). - **OCM hard negative & OCM scribble:** Custom datasets created for this work. - **Sentinel-2 imagery:** Copernicus Sentinel data, processed by ESA and Microsoft Planetary Computer. ## License Please refer to the individual source dataset licenses: - CloudSEN12: [Creative Commons Zero v1.0 Universal](https://huggingface.co/datasets/tacofoundation/cloudsen12) - Kappaset: See [Creative Commons Attribution 4.0 International](https://zenodo.org/records/7100327) - OCM hard negative & OCM scribble: [Creative Commons Zero v1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/)

提供机构：

NickWright

5,000+

优质数据集

54 个

任务类型

进入经典数据集