five

falafel-hockey/sentinel2-lejepa-global-diverse-256

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/falafel-hockey/sentinel2-lejepa-global-diverse-256
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-sa-4.0 pretty_name: Sentinel-2 LeJEPA Preset-Biased (Small) tags: - earth-observation - sentinel-2 - self-supervised-learning - satellite-imagery - pretraining - remote-sensing size_categories: - 1K<n<10K task_categories: - image-feature-extraction --- # Sentinel-2 LeJEPA Preset-Biased (Small) A small, preset-biased Sentinel-2 L2A chip dataset curated for self-supervised pretraining of a [LeJEPA](https://arxiv.org/abs/2511.08544) ResNet-18 encoder. Built as a reproducibility artifact for the [Sentinel Change Explorer](https://github.com/alexw0/sentinel-change-explorer) proof-of-concept foundation-model change-detection feature. **This is a proof of concept, not a general-purpose EO pretraining corpus.** It is intentionally tiny (~thousands of chips) and biased toward the five demo AOIs the Sentinel Change Explorer app highlights. Use it to reproduce that specific PoC, not as a substitute for SSL4EO-S12, Clay, or Prithvi. ## Dataset snapshot | Field | Value | |---------------------|----------------------------------| | Build date | 2026-04-04 | | Total chips | 5000 | | Preset chips (~70%) | 0 | | Global chips (~30%) | 5000 | | Train split | 4500 | | Validation split | 500 | | Chip size | 128 x 128 px @ 10 m/px (1.28 km) | | Bands | red, green, blue, nir, swir16 | | Dtype | uint16 (raw L2A reflectance) | ## Sampling methodology Chips are drawn from two sources in roughly a 70/30 mix: 1. **Preset AOIs (~70%).** For each of the 5 demo presets in the Sentinel Change Explorer app, the builder expands the tight demo bbox into a 10 km square centered on the preset's centroid, searches STAC (Element84 Earth Search v1) for Sentinel-2 L2A scenes in both `before_range` and `after_range`, loads the 5 reflectance bands + SCL via the same `src.sentinel.load_bands` the app uses, and tile-crops into non-overlapping 128x128 chips. 2. **Global diversity points (~30%).** A hand-curated list of 30 globally diverse points (deserts, forests, croplands, urban cores, coasts, ice, wetlands) across every inhabited continent, each sampled at 2-3 dates spread across seasons. Same fetch-and-tile flow with a 5.12 km AOI. ### Rejection filters Every candidate chip is tested against two filters and dropped if it fails either: - **Cloud/shadow fraction > 25%**, computed from the Sentinel-2 Scene Classification Layer (SCL classes 3, 8, 9, 10). - **Fill fraction > 10%**, defined as pixels where all 5 reflectance bands equal zero (true no-data, not just a single dark band). ### Preset AOIs | Preset | Center (lon, lat) | Before range | After range | |--------|-------------------|--------------|-------------| | Lahaina Wildfire, Maui | (-156.678, 20.877) | 2023-05-01 → 2023-07-31 | 2023-09-01 → 2023-11-30 | | Pakistan Mega-Flood, Sindh | (67.750, 26.735) | 2022-05-01 → 2022-06-30 | 2022-08-20 → 2022-09-30 | | Gigafactory Berlin | (13.800, 52.400) | 2019-05-01 → 2019-07-31 | 2023-05-01 → 2023-07-31 | | Black Summer Bushfires, Australia | (150.125, -33.485) | 2019-08-01 → 2019-10-31 | 2020-02-01 → 2020-04-30 | | Egypt's New Capital | (31.820, 30.030) | 2018-01-01 → 2018-03-31 | 2023-10-01 → 2023-12-31 | ### Global diversity points - `sahara_algeria` — (2.00, 25.00) - `gobi_mongolia` — (104.00, 43.50) - `atacama_chile` — (-69.30, -23.80) - `namib_namibia` — (15.00, -23.50) - `simpson_australia` — (137.50, -25.50) - `amazon_brazil` — (-60.00, -3.50) - `congo_drc` — (21.00, -1.00) - `boreal_canada` — (-95.00, 54.00) - `siberia_taiga` — (105.00, 62.00) - `pnw_usa` — (-123.50, 47.50) - `iowa_corn_belt` — (-93.50, 42.00) - `pampas_argentina` — (-62.00, -35.00) - `po_valley_italy` — (10.50, 45.00) - `punjab_india` — (75.50, 30.70) - `tokyo_japan` — (139.75, 35.70) - `nyc_usa` — (-73.95, 40.75) - `lagos_nigeria` — (3.40, 6.50) - `sao_paulo_brazil` — (-46.63, -23.55) - `cairo_egypt` — (31.25, 30.05) - `shanghai_china` — (121.47, 31.23) - `chesapeake_bay` — (-76.20, 38.50) - `dutch_coast` — (4.50, 52.50) - `normandy_france` — (-0.50, 49.30) - `greenland_glacier` — (-49.70, 69.20) - `alps_switzerland` — (8.00, 46.50) - `andes_peru` — (-72.00, -13.50) - `himalaya_nepal` — (86.50, 27.80) - `everglades_usa` — (-80.80, 25.80) - `pantanal_brazil` — (-56.00, -17.50) - `okavango_botswana` — (22.80, -19.30) ## Schema Each row is: ``` { "bands": Array3D(shape=(5, 128, 128), dtype=uint16), "bbox": Sequence(float32, length=4), # (west, south, east, north) WGS84 "acquisition_date": Value(string), # ISO date of the source scene "scene_id": Value(string), # STAC item id "source": ClassLabel(names=["preset", "global"]), "preset_name": Value(string), # "" for global chips } ``` ## Normalization stats Per-band mean and standard deviation computed over the **training split** (uint16 reflectance, before any scaling): | Band | Mean | Std | |--------|------------|------------| | red | 1298.91 | 1192.39 | | green | 1086.62 | 908.00 | | blue | 830.22 | 846.53 | | nir | 2467.29 | 1264.88 | | swir16 | 2357.63 | 1504.00 | These are also shipped as `norm_stats.json` in the dataset bundle. The matching LeJEPA model repo embeds a copy so inference doesn't need to pull the dataset. ## Usage ```python from datasets import load_dataset ds = load_dataset("falafel-hockey/sentinel2-lejepa-global-diverse-256") print(ds) # DatasetDict with "train" and "validation" splits sample = ds["train"][0] print(sample["bands"].shape) # (5, 128, 128) print(sample["source"]) # 0 = preset, 1 = global ``` The companion pretrained LeJEPA ResNet-18 (5-band) is published separately and consumes these chips at native resolution without further resizing. ## Limitations - **Tiny scale.** Thousands of chips, not millions. A real SSL corpus for remote sensing is 2-3 orders of magnitude larger. Expect the resulting features to overfit to the sampled AOIs and date windows. - **Preset bias by design.** 70% of chips come from 5 specific locations chosen because they are the demo AOIs in the companion app. This is intentional for the PoC but makes the features a poor fit for general-purpose EO tasks. - **Single sensor, single level.** Sentinel-2 L2A only. No Sentinel-1, no Landsat, no other modalities. - **5 bands only.** B02, B03, B04, B08, B11. The red-edge, cirrus, and SWIR22 bands are intentionally excluded to keep the model compact for M1 inference. - **No deduplication across dates.** Chips from the same AOI across different acquisition dates are both kept. This is a feature for temporal-invariance pretraining, but means chips are not i.i.d. ## License and attribution - Chips are released under **CC-BY-SA-4.0**, matching Copernicus Sentinel data's terms for derived products. - **Contains modified Copernicus Sentinel data [2023-2026], ESA.** Source imagery: Sentinel-2 L2A via [Element84 Earth Search v1](https://registry.opendata.aws/sentinel-2-l2a-cogs/). ## Citation ```bibtex @misc{sentinel2_lejepa_preset_biased_small, title = {Sentinel-2 LeJEPA Preset-Biased (Small)}, author = {Wheelis, Alex}, year = {2026}, url = {https://huggingface.co/datasets/falafel-hockey/sentinel2-lejepa-global-diverse-256} } @misc{balestriero2025lejepa, title = {LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics}, author = {Balestriero, Randall and LeCun, Yann}, year = {2025}, eprint = {2511.08544}, archivePrefix = {arXiv} } ```
提供机构:
falafel-hockey
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作