NickWright/OmniCloudMask-Combined-Training-Dataset
收藏Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NickWright/OmniCloudMask-Combined-Training-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- image-segmentation
tags:
- remote-sensing
- semantic-segmentation
- sentinel-2
- cloud-detection
- earth-observation
- omnicloudmask
pretty_name: OmniCloudMask Combined Training Dataset
size_categories:
- 100K<n<1M
---
# OmniCloudMask Combined Training Dataset
A combined multi-source dataset for training cloud and cloud shadow segmentation models on Sentinel-2 satellite imagery. The dataset contains **103,548 image-label pairs** (100,528 training + 1,070 validation + 1,950 test) drawn from 4 source datasets: CloudSEN12, Kappaset, OCM hard negative, and OCM scribble. CloudSEN12 is represented in several variants (different processing levels, super-resolution, and re-downloaded imagery) to improve model generalisation.
This dataset was used to train the v4 weights of OmniCloudMask.
| | |
|---|---|
| **Project** | [github.com/DPIRD-DMA/OmniCloudMask](https://github.com/DPIRD-DMA/OmniCloudMask) |
| **Paper** | [Training sensor-agnostic deep learning models for remote sensing (RSE, 2025)](https://doi.org/10.1016/j.rse.2025.114694) |
| **Documentation** | [omnicloudmask.readthedocs.io](https://omnicloudmask.readthedocs.io/en/latest/) |
| **Spatial Distribution** | [dpird-dma.github.io/OCM-training-data-map](https://dpird-dma.github.io/OCM-training-data-map/) |
## Sentinel-2 Bands
Each image contains 3 spectral bands stored as a 3-channel GeoTIFF:
| Channel | Sentinel-2 Band | Description | Native GSD |
|---------|-----------------|-------------|------------|
| 0 | B04 | Red | 10 m |
| 1 | B03 | Green | 10 m |
| 2 | B8A | NIR Narrow | 20 m (upsampled to 10 m) |
## Label Classes
4 semantic classes plus an ignore value:
| Value | Class | Description |
|-------|-------|-------------|
| 0 | Clear | No cloud or shadow |
| 1 | Thick Cloud | Opaque cloud |
| 2 | Thin Cloud | Semi-transparent cloud |
| 3 | Cloud Shadow | Shadow cast by clouds |
| 99 | No-data | Ignore during training |
## Storage Format
The dataset is stored as Parquet shards. Each row contains one image-label pair with the following columns:
| Column | Type | Description |
|--------|------|-------------|
| `subset` | string | Source sub-dataset name (e.g. `"CloudSEN12 high"`) |
| `processing_level` | string | `"L1C"`, `"L2A"`, or `""` |
| `image_filename` | string | Original filename for traceability |
| `label_filename` | string | Original label filename |
| `image` | binary | Raw GeoTIFF bytes (3-band, uint16, LZW compressed) |
| `label` | binary | Raw GeoTIFF bytes (1-band, uint8, LZW compressed) |
### GeoTIFF Details
- **Image dtype:** `uint16` — standard Sentinel-2 encoding (reflectance × 10,000, e.g. values in the 0–10,000+ range)
- **Label dtype:** `uint8`
- **Geolocation:** CRS and affine transform preserved inside each GeoTIFF (UTM projections, WGS84 datum). Exception: Kappaset images are not georeferenced.
**Kappaset note:** The original Kappaset NetCDF files store band values normalised by 65,535 (uint16 max). During conversion to GeoTIFF, values are multiplied by 65,535 to restore standard Sentinel-2 DN scale.
### Usage Example
```python
import io
import rasterio
from datasets import load_dataset
ds = load_dataset("NickWright/OmniCloudMask-Combined-Training-Dataset", split="train")
row = ds[0]
with rasterio.open(io.BytesIO(row["image"])) as src:
image = src.read() # shape: (3, H, W), dtype: uint16
crs = src.crs # e.g. EPSG:32719
with rasterio.open(io.BytesIO(row["label"])) as src:
label = src.read(1) # shape: (H, W), dtype: uint8
```
## Image Sizes
| Pixel Dimensions | Approximate Ground Coverage | Datasets |
|------------------|----------------------------|----------|
| 509 x 509 px | 5.09 x 5.09 km | CloudSEN12 high, scribble, validation, test, Planetary Computer, super res tiles, Kappaset, OCM hard negative, OCM scribble |
| 1018 x 1018 px | 5.09 x 5.09 km (5 m) | CloudSEN12 super res raw |
| 2000 x 2000 px | 20 x 20 km | CloudSEN12 2k |
**Why 509 instead of 512?** The CloudSEN12 dataset — the largest and highest-quality source in this collection — uses 509 x 509 px tiles. To maintain consistency, all other datasets adopt the same dimensions. Kappaset images (originally 512 x 512 px) are clipped to 509 x 509 px to match.
## Dataset Sources
### CloudSEN12 High (16,980 images — 8,490 L1C + 8,490 L2A)
High-quality dense pixel-wise labels from the CloudSEN12 dataset. Includes both L1C (top-of-atmosphere) and L2A (surface reflectance) processing levels for each scene, sharing the same label mask.
- **Source:** [TACO Foundation on HuggingFace](https://huggingface.co/datasets/tacofoundation/cloudsen12) (`cloudsen12-l1c`, `cloudsen12-l2a`)
- **Size:** 509 x 509 px
- **Label type:** Dense, human-annotated
- **Split used:** Train only
### CloudSEN12 Scribble (20,000 images — 10,000 L1C + 10,000 L2A)
Sparse scribble annotations covering all splits. Original 7 classes remapped to 4:
| Original | Remapped | Meaning |
|----------|----------|---------|
| 0 | 0 | Clear |
| 1, 2 | 1 | Thick Cloud |
| 3, 4 | 2 | Thin Cloud |
| 5, 6 | 3 | Cloud Shadow |
| 99 | 99 | No-data |
- **Source:** [TACO Foundation on HuggingFace](https://huggingface.co/datasets/tacofoundation/cloudsen12)
- **Size:** 509 x 509 px
- **Label type:** Sparse scribble annotations (most pixels are 99/no-data)
- **Splits used:** Train + Val + Test
### CloudSEN12 2k (1,694 images — 847 L1C + 847 L2A)
Larger tiles from CloudSEN12 with dense labels. Both L1C and L2A processing levels.
- **Source:** [TACO Foundation on HuggingFace](https://huggingface.co/datasets/tacofoundation/cloudsen12)
- **Size:** 2000 x 2000 px
- **Label type:** Dense, human-annotated
- **Splits used:** Train + Val + Test
### CloudSEN12 Planetary Computer (8,403 images — L2A only)
The same scenes as CloudSEN12 high, but the L2A imagery was re-downloaded from Microsoft Planetary Computer. Labels are identical to CloudSEN12 high. This provides imagery processed through a different atmospheric correction pipeline, improving model generalisation.
- **Source:** [Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/) STAC API, `sentinel-2-l2a` collection
- **Size:** 509 x 509 px
- **Label type:** Dense, human-annotated (same labels as CloudSEN12 high)
- **Processing level:** L2A only
- **Note:** ~87 scenes could not be matched on Planetary Computer and were skipped
### CloudSEN12 Super Resolution Tiles (33,960 images — L1C only)
Derived from CloudSEN12 high L1C train images using a 2x ESRGAN super-resolution model. Each 509x509 source image is upscaled to 1018x1018 px (~5 m effective resolution), then split into a 2x2 grid of 509x509 tiles. Labels are pixel-repeated to match.
Colour statistics (mean, std) are transferred from the original image back to the super-resolved output to preserve radiometric consistency.
- **Super-resolution model:** [`Phips/2xNomosUni_esrgan_multijpg`](https://huggingface.co/Phips/2xNomosUni_esrgan_multijpg)
- **Size:** 509 x 509 px (4 tiles per source image)
- **Label type:** Dense (pixel-repeated from original)
- **Processing level:** L1C only
### CloudSEN12 Super Resolution Raw (8,490 images — L1C only)
Same super-resolution pipeline as above, but stored as full 1018x1018 px images (not tiled).
- **Size:** 1018 x 1018 px
- **Label type:** Dense (pixel-repeated from original)
- **Processing level:** L1C only
### Kappaset (9,250 images — L1C only)
An independent cloud labelling dataset converted from NetCDF to GeoTIFF. Original 6 classes remapped to 4:
| Original | Remapped | Meaning |
|----------|----------|---------|
| 0 | 99 | No-data |
| 1 | 0 | Clear |
| 2 | 3 | Cloud Shadow |
| 3 | 2 | Thin Cloud |
| 4 | 1 | Thick Cloud |
| 5 | 99 | No-data |
- **Source:** [Zenodo record 7100327](https://zenodo.org/records/7100327)
- **Size:** 509 x 509 px
- **Label type:** Dense, human-annotated
- **Processing level:** L1C only
### OCM hard negative (920 images — L2A only)
Cloud-free scenes that the model previously misclassified as cloudy. All labels are entirely class 0 (clear). These scenes were specifically curated to include cloud-like surfaces (snow, sand, haze, bright surfaces).
- **Source:** [Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/) (`sentinel-2-l2a`), custom curated
- **Size:** 509 x 509 px
- **Label type:** All-zero masks (every pixel = clear)
- **Processing level:** L2A
- **Scene dates:** 2018–2024, global coverage
### OCM scribble (831 images — L2A only)
Custom scribble-annotated scenes downloaded from Planetary Computer, targeting scenarios underrepresented in CloudSEN12 and Kappaset.
- **Source:** [Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/) (`sentinel-2-l2a`), custom curated
- **Size:** 509 x 509 px
- **Label type:** Sparse scribble annotations
- **Processing level:** L2A
### CloudSEN12 Validation (1,070 images — 535 L1C + 535 L2A)
Held-out validation set with dense labels. Used only for evaluation, never for training.
- **Source:** [TACO Foundation on HuggingFace](https://huggingface.co/datasets/tacofoundation/cloudsen12)
- **Size:** 509 x 509 px
- **Label type:** Dense, human-annotated
- **Processing levels:** L1C and L2A
### CloudSEN12 Test (1,950 images — 975 L1C + 975 L2A)
Held-out test set with dense labels. Used only for final evaluation, never for training or validation.
- **Source:** [TACO Foundation on HuggingFace](https://huggingface.co/datasets/tacofoundation/cloudsen12)
- **Size:** 509 x 509 px
- **Label type:** Dense, human-annotated
- **Processing levels:** L1C and L2A
## Image Count Summary
| Dataset | Images | L1C | L2A | Role |
|---------|-------:|----:|----:|------|
| CloudSEN12 high | 16,980 | 8,490 | 8,490 | Train |
| CloudSEN12 scribble | 20,000 | 10,000 | 10,000 | Train |
| CloudSEN12 2k | 1,694 | 847 | 847 | Train |
| CloudSEN12 high planetary computer | 8,403 | — | 8,403 | Train |
| CloudSEN12 high super res tiles | 33,960 | 33,960 | — | Train |
| CloudSEN12 high super res raw | 8,490 | 8,490 | — | Train |
| Kappaset | 9,250 | 9,250 | — | Train |
| OCM Hard negative | 920 | — | 920 | Train |
| OCM scribble | 831 | — | 831 | Train |
| CloudSEN12 validation | 1,070 | 535 | 535 | Val |
| CloudSEN12 test | 1,950 | 975 | 975 | Test |
| **Total** | **103,548** | **72,547** | **31,001** | |
## Dataset Weights
Each sub-dataset is assigned a loss weight during training to reflect label quality and reliability:
| Dataset | Weight |
|---------|-------:|
| CloudSEN12 high | 1.0 |
| CloudSEN12 scribble | 1.0 |
| CloudSEN12 2k | 0.8 |
| CloudSEN12 high super res tiles | 1.1 |
| CloudSEN12 high super res raw | 1.0 |
| CloudSEN12 high planetary computer | 1.0 |
| Kappaset | 0.2 |
| OCM Hard negative | 0.7 |
| OCM scribble | 1.1 |
## Citations
If you use this dataset, please cite the original sources:
- **CloudSEN12:** Aybar, C., et al. "CloudSEN12, a global dataset for semantic understanding of cloud and cloud shadow in Sentinel-2." *Sci Data*, 2022. [Paper](https://doi.org/10.1038/s41597-022-01878-2) | [Project page](https://cloudsen12.github.io/) | [Dataset](https://huggingface.co/datasets/tacofoundation/cloudsen12)
- **Kappaset:** Domnich, M., et al. [Paper](https://doi.org/10.3390/rs13204100) | [Dataset](https://zenodo.org/records/7100327).
- **OCM hard negative & OCM scribble:** Custom datasets created for this work.
- **Sentinel-2 imagery:** Copernicus Sentinel data, processed by ESA and Microsoft Planetary Computer.
## License
Please refer to the individual source dataset licenses:
- CloudSEN12: [Creative Commons Zero v1.0 Universal](https://huggingface.co/datasets/tacofoundation/cloudsen12)
- Kappaset: See [Creative Commons Attribution 4.0 International](https://zenodo.org/records/7100327)
- OCM hard negative & OCM scribble: [Creative Commons Zero v1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/)
提供机构:
NickWright



