Major-TOM/COP-GEN-Benchmark

Name: Major-TOM/COP-GEN-Benchmark
Creator: Major-TOM
Published: 2026-04-19 14:03:33
License: 暂无描述

Hugging Face2026-04-19 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/Major-TOM/COP-GEN-Benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - image-to-image tags: - earth-observation - sentinel-2 - major-tom - generative-benchmark - remote-sensing pretty_name: COP-GEN Benchmark size_categories: - 10K<n<100K configs: - config_name: real data_files: - split: train path: real/data/*.parquet - config_name: copgen data_files: - split: train path: copgen/data/*.parquet - config_name: terramind data_files: - split: train path: terramind/data/*.parquet dataset_info: features: - name: grid_cell dtype: string - name: thumbnail dtype: image - name: sample_idx dtype: int32 - name: sample_id dtype: string - name: date dtype: string - name: crs dtype: string - name: ul_x dtype: float64 - name: ul_y dtype: float64 - name: B01 dtype: binary - name: B02 dtype: binary - name: B03 dtype: binary - name: B04 dtype: binary - name: B05 dtype: binary - name: B06 dtype: binary - name: B07 dtype: binary - name: B08 dtype: binary - name: B8A dtype: binary - name: B09 dtype: binary - name: B11 dtype: binary - name: B12 dtype: binary --- # COP-GEN Benchmark Evaluation dataset for the COP-GEN paper ([arXiv:2603.03239](https://arxiv.org/abs/2603.03239)). Enables stochastic-benchmark evaluation of generative EO models by comparing generated sample sets against real multi-temporal Sentinel-2 observations at 495 geographically diverse locations. ## Subsets Three parallel subsets with **identical schema** and **shared georeferencing** (same Major TOM v2 1056x1056 grid): | Subset | Source | Samples/cell | |-------------|--------------------------------|--------------| | `real` | Sentinel-2 L2A (GEE, cloud-free) | 16 | | `copgen` | COP-GEN outputs (subsampled from 33 seeds) | 16 | | `terramind` | TerraMind outputs | 16 | Each sample is a 1056x1056 tile with 12 Sentinel-2 bands (B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B11, B12) at their native resolution (10 / 20 / 60 m), stored as per-band `uint16` GeoTIFF byte blobs inside each parquet row. ## Quick start ```python from datasets import load_dataset # Load one of the three configs real = load_dataset("Major-TOM/COP-GEN-Benchmark", "real", split="train") copgen = load_dataset("Major-TOM/COP-GEN-Benchmark", "copgen", split="train") terramind = load_dataset("Major-TOM/COP-GEN-Benchmark", "terramind", split="train") # Decode a single band from the first row import rasterio, io row = real[0] with rasterio.open(io.BytesIO(row["B02"])) as src: b02 = src.read(1) # (1056, 1056) uint16 print(row["grid_cell"], row["date"], b02.shape) ``` ## Schema All three subsets share the following columns: | Column | Type | Notes | |---------------|---------|---------------------------------------------| | `grid_cell` | str | Major TOM cell ID (e.g. `106D_246R`) | | `sample_idx` | int | 0..15 within cell | | `sample_id` | str | STAC product ID (real) or `seed_N` (models) | | `date` | str | ISO date for real; empty for synthetic | | `crs` | str | UTM CRS (e.g. `EPSG:32734`) | | `ul_x, ul_y` | float | v2 grid upper-left in CRS metres | | `B01..B12` | bytes | per-band GeoTIFF blob, uint16, native res | | `thumbnail` | bytes | 256x256 JPEG RGB composite | ## Reproducing the benchmark evaluation COP-GEN and TerraMind outputs natively cover the centre 192x192 pixels (1.92 km) of each 1056 tile. To extract this evaluation footprint identically across all three subsets, use the provided `benchmark_footprint.py` utility: ```python import rasterio, io, json from metadata.benchmark_footprint import crop_benchmark_footprint, load_grid grid = load_grid("metadata/benchmark_grid.json") row = real[0] with rasterio.open(io.BytesIO(row["B02"])) as src: window = crop_benchmark_footprint(src, row["grid_cell"], grid) # (1, 192, 192) uint16 — same geographic footprint for all three subsets ``` See `metadata/benchmark_footprint.py` for full documentation of the cropping convention. The function handles CRS mismatches and is the exact method used to evaluate the results reported in the paper. ## Metadata - `metadata/benchmark_grid.json` — cell origins for the 192x192 evaluation footprint (used by `crop_benchmark_footprint`) - `metadata/cells.parquet` — per-cell summary (grid_cell, ul_x, ul_y, crs, mgrs_tile) - `metadata/benchmark_footprint.py` — the crop utility ## Citation ```bibtex @article{copgen2026, title={COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design}, author={Espinosa, Miguel and Gmelich Meijling, Eva and Marsocci, Valerio and Crowley, Elliot J. and Czerkawski, Mikolaj}, year={2026}, journal={arXiv preprint arXiv:2603.03239}, url={https://arxiv.org/abs/2603.03239}, } ``` ## Licensing - Sentinel-2 data: CC-BY 4.0 (Copernicus). - COP-GEN outputs: released under CC-BY 4.0 by the authors. - TerraMind outputs: please check the TerraMind licensing terms before redistribution.

提供机构：

Major-TOM

5,000+

优质数据集

54 个

任务类型

进入经典数据集