Core-S2L2A-249k
收藏魔搭社区2026-05-02 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k
下载链接
链接失效反馈官方服务:
资源简介:
# Core-S2L2A-249k
**Core-S2L2A-249k** is a curated subset of the [Major-TOM](https://github.com/ESA-PhiLab/MajorTOM) Core-S2L2A dataset, containing 248,719 Sentinel-2 L2A image patches uniformly sampled across the globe. It serves as the **source imagery** for pre-computed embedding datasets used by the [EarthEmbeddingExplorer](https://modelscope.ai/studios/Major-TOM/EarthEmbeddingExplorer) cross-modal retrieval web application.
## Overview
| Property | Value |
|----------|-------|
| Source | Major-TOM Core-S2L2A |
| Sensor | Sentinel-2 MSI (Level-2A) |
| Number of patches | 248,719 |
| Patch size | 384 × 384 pixels |
| Spectral bands | 12 (B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B11, B12) |
| Spatial resolution | 10 m / 20 m / 60 m (depending on band) |
| Output format | GeoParquet |
| License | CC-BY-SA-4.0 |
## Sampling Strategy
This dataset was created by **uniform grid sampling** from the full Major-TOM Core-S2L2A archive:
1. A global grid is overlaid on the Major-TOM tile index.
2. For every **1/9 sampled grid cell**, the central bounding box is selected as the crop region.
3. Each crop is extracted at a fixed size of **384 × 384 pixels** to ensure consistency across all downstream embedding models.
This strategy guarantees spatial diversity while keeping the dataset size manageable for large-scale embedding generation and interactive web retrieval.
## File Layout
```
.
├── metadata_249k.parquet # Metadata and geospatial index (248,719 rows)
└── images_249k/
├── part_00001.parquet # Image patches shard 1
├── part_00002.parquet # Image patches shard 2
└── ...
```
### Metadata Schema (`metadata_249k.parquet`)
| Column | Type | Description |
|--------|------|-------------|
| `product_id` | string | Original Sentinel-2 product identifier |
| `timestamp` | datetime | Acquisition timestamp |
| `grid_cell` | string | Major-TOM grid cell identifier |
| `grid_row_u` | int16 | Grid row index |
| `grid_col_r` | int16 | Grid column index |
| `geometry` | geometry | WGS-84 polygon (footprint) |
| `centre_lat` | float32 | Latitude of patch centre |
| `centre_lon` | float32 | Longitude of patch centre |
| `utm_footprint` | string | Original UTM footprint as WKT |
| `utm_crs` | string | UTM CRS (e.g. EPSG:32633) |
| `pixel_bbox` | list<int> | Pixel bounding box [x_min, y_min, x_max, y_max] |
| `parquet_url` | string | Path to the image shard containing this patch |
| `parquet_row` | int64 | Row index within the image shard |
### Image Patch Schema (`images_249k/part_*.parquet`)
Each row contains a single 384 × 384 Sentinel-2 L2A patch stored as a 3-D array with shape `(384, 384, 12)` in uint16 format. The 12 bands follow the order:
`[B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B11, B12]`
## Pre-computed Embedding Datasets
The following embedding datasets were computed from **Core-S2L2A-249k** using various foundation models. All embeddings share the same 248,719 samples and geospatial metadata, enabling fair cross-model comparison.
| Filename | Embedding Model | Crop Size | Model Input Size | Embedding Dim | Source |
|----------|-----------------|-----------|------------------|---------------|--------|
| `SigLIP_crop_384x384.parquet` | [SigLIP (ViT-SO400M-14-SigLIP-384)](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) | 384×384 | 384×384 | 1152 | [Major-TOM/Core-S2RGB-249k-SigLIP](https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-249k-SigLIP) |
| `FarSLIP_crop_384x384.parquet` | [FarSLIP (ViT-B-16)](https://huggingface.co/ZhenShiL/FarSLIP) | 384×384 | 224×224 | 512 | [Major-TOM/Core-S2RGB-249k-FarSLIP](https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-249k-FarSLIP) |
| `DINOv2_crop_384x384.parquet` | [DINOv2-large](https://huggingface.co/facebook/dinov2-large) | 384×384 | 224×224 | 1024 | [Major-TOM/Core-S2RGB-249k-DINOv2](https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-249k-DINOv2) |
| `SatCLIP_crop_384x384.parquet` | [SatCLIP (ViT16-L40)](https://github.com/microsoft/satclip) | 384×384 | 224×224 | 256 | [Major-TOM/Core-S2L2A-249k-SatCLIP](https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k-SatCLIP) |
| `Clay_crop_384x384.parquet` | [Clay v1.5](https://github.com/Clay-foundation/model) | 384×384 | 384×384 | 1024 | [Major-TOM/Core-S2L2A-249k-Clay-v1_5](https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k-Clay-v1_5) |
| `OLMoEarth_Base_crop_384x384.parquet` | [OLMoEarth-Base](https://huggingface.co/allenai/OLMoEarth-Base-WS) | 384×384 | 128×128 | 768 | [Major-TOM/Core-S2L2A-249k-OlmoEarth](https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k-OlmoEarth) |
> **Note**: SigLIP, FarSLIP, and DINOv2 use RGB-only inputs (B04, B03, B02). SatCLIP, Clay, and OLMoEarth use multi-spectral inputs.
## Usage
```python
import pandas as pd
# Load metadata
meta = pd.read_parquet("metadata_249k.parquet")
print(len(meta), "patches")
# Load image patches from a shard
images = pd.read_parquet("images_249k/part_00001.parquet")
print(images.iloc[0]["B04"].shape) # (384, 384)
```
## Web Application
Explore these embeddings interactively with the [EarthEmbeddingExplorer](https://modelscope.ai/studios/Major-TOM/EarthEmbeddingExplorer) web application, supporting text, image, and geolocation queries.
## Citation
If you use this dataset, please cite both the EarthEmbeddingExplorer tutorial paper and the original Major-TOM paper:
```bibtex
@article{zheng2026earthembeddingexplorer,
title={EarthEmbeddingExplorer: A Web Application for Cross-Modal Retrieval of Global Satellite Images},
author={Zheng, Yijie and Wu, Weijie and Wu, Bingyue and Zhao, Long and Li, Guoqing and Czerkawski, Mikolaj and Klemmer, Konstantin},
journal={arXiv preprint arXiv:2603.29441},
year={2026},
note={ICLR 2026 Workshop ML4RS Tutorial Track (oral)}
}
```
```bibtex
@inproceedings{francis2024majortom,
title={Major TOM: Expandable Datasets for Earth Observation},
author={Francis, Alistair and Czerkawski, Mikolaj and others},
year={2024},
booktitle={IGARSS 2024},
eprint={2402.12095},
archivePrefix={arXiv}
}
```
## License
This dataset is released under the [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/) license.
# Core-S2L2A-249k数据集
**Core-S2L2A-249k** 是[Major-TOM](https://github.com/ESA-PhiLab/MajorTOM)的Core-S2L2A数据集的精选子集,包含248719幅全球均匀采样的哨兵2号L2A级产品(Sentinel-2 L2A)图像块。该数据集作为预计算嵌入数据集的**源图像**,供[EarthEmbeddingExplorer](https://modelscope.ai/studios/Major-TOM/EarthEmbeddingExplorer)跨模态检索Web应用使用。
## 概览
| 属性 | 取值 |
|----------|-------|
| 数据来源 | Major-TOM Core-S2L2A |
| 传感器 | 哨兵2号多光谱仪器(Sentinel-2 MSI,Level-2A) |
| 图像块数量 | 248,719 |
| 图像块尺寸 | 384 × 384 像素 |
| 光谱波段 | 12个(B01、B02、B03、B04、B05、B06、B07、B08、B8A、B09、B11、B12) |
| 空间分辨率 | 10米/20米/60米(依波段而定) |
| 输出格式 | GeoParquet |
| 许可协议 | CC-BY-SA-4.0 |
## 采样策略
本数据集通过**均匀网格采样**从完整的Major-TOM Core-S2L2A数据集归档中生成:
1. 将全球网格叠加至Major-TOM瓦片索引中。
2. 针对每1/9的采样网格单元,选取其中心边界框作为裁剪区域。
3. 所有裁剪区域统一裁剪为384 × 384像素的固定尺寸,以确保所有下游嵌入模型的输入一致性。
该采样策略在保证空间多样性的同时,将数据集规模控制在便于开展大规模嵌入生成与交互式Web检索的范围内。
## 文件布局
.
├── metadata_249k.parquet # 元数据与地理空间索引(含248,719条记录)
└── images_249k/
├── part_00001.parquet # 图像块分片1
├── part_00002.parquet # 图像块分片2
└── ...
### 元数据模式(`metadata_249k.parquet`)
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `product_id` | 字符串 | 原始Sentinel-2产品标识符 |
| `timestamp` | 日期时间 | 采集时间戳 |
| `grid_cell` | 字符串 | Major-TOM网格单元标识符 |
| `grid_row_u` | int16 | 网格行索引 |
| `grid_col_r` | int16 | 网格列索引 |
| `geometry` | 几何类型 | WGS-84多边形(覆盖范围) |
| `centre_lat` | float32 | 图像块中心纬度 |
| `centre_lon` | float32 | 图像块中心经度 |
| `utm_footprint` | 字符串 | 原始UTM覆盖范围的熟知文本(WKT)格式 |
| `utm_crs` | 字符串 | UTM坐标参考系(例如EPSG:32633) |
| `pixel_bbox` | 整数列表 | 像素边界框 [x_min, y_min, x_max, y_max] |
| `parquet_url` | 字符串 | 包含该图像块的分片文件路径 |
| `parquet_row` | int64 | 该图像块在分片中的行索引 |
### 图像块模式(`images_249k/part_*.parquet`)
每一行包含单个384 × 384像素的Sentinel-2 L2A图像块,以uint16格式存储为形状为`(384, 384, 12)`的三维数组。12个波段的顺序为:
`[B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B11, B12]`
## 预计算嵌入数据集
以下嵌入数据集均基于**Core-S2L2A-249k**通过多种基础模型计算得到。所有嵌入数据集均包含248,719条样本与对应的地理空间元数据,可实现公平的跨模型对比。
| 文件名 | 嵌入模型 | 裁剪尺寸 | 模型输入尺寸 | 嵌入维度 | 数据来源 |
|----------|-----------------|-----------|------------------|---------------|--------|
| `SigLIP_crop_384x384.parquet` | [SigLIP (ViT-SO400M-14-SigLIP-384)](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) | 384×384 | 384×384 | 1152 | [Major-TOM/Core-S2RGB-249k-SigLIP](https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-249k-SigLIP) |
| `FarSLIP_crop_384x384.parquet` | [FarSLIP (ViT-B-16)](https://huggingface.co/ZhenShiL/FarSLIP) | 384×384 | 224×224 | 512 | [Major-TOM/Core-S2RGB-249k-FarSLIP](https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-249k-FarSLIP) |
| `DINOv2_crop_384x384.parquet` | [DINOv2-large](https://huggingface.co/facebook/dinov2-large) | 384×384 | 224×224 | 1024 | [Major-TOM/Core-S2RGB-249k-DINOv2](https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-249k-DINOv2) |
| `SatCLIP_crop_384x384.parquet` | [SatCLIP (ViT16-L40)](https://github.com/microsoft/satclip) | 384×384 | 224×224 | 256 | [Major-TOM/Core-S2L2A-249k-SatCLIP](https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k-SatCLIP) |
| `Clay_crop_384x384.parquet` | [Clay v1.5](https://github.com/Clay-foundation/model) | 384×384 | 384×384 | 1024 | [Major-TOM/Core-S2L2A-249k-Clay-v1_5](https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k-Clay-v1_5) |
| `OLMoEarth_Base_crop_384x384.parquet` | [OLMoEarth-Base](https://huggingface.co/allenai/OLMoEarth-Base-WS) | 384×384 | 128×128 | 768 | [Major-TOM/Core-S2L2A-249k-OlmoEarth](https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k-OlmoEarth) |
> **注意**:SigLIP、FarSLIP与DINOv2仅使用RGB波段(B04、B03、B02)作为输入;SatCLIP、Clay与OLMoEarth则使用多光谱波段作为输入。
## 使用示例
python
import pandas as pd
# 加载元数据
meta = pd.read_parquet("metadata_249k.parquet")
print(len(meta), "个图像块")
# 从分片中加载图像块
images = pd.read_parquet("images_249k/part_00001.parquet")
print(images.iloc[0]["B04"].shape) # (384, 384)
## Web应用
可通过[EarthEmbeddingExplorer](https://modelscope.ai/studios/Major-TOM/EarthEmbeddingExplorer) Web应用交互式探索这些嵌入数据集,该应用支持文本、图像与地理定位查询。
## 引用
若使用本数据集,请同时引用EarthEmbeddingExplorer教程论文与原始Major-TOM论文:
bibtex
@article{zheng2026earthembeddingexplorer,
title={EarthEmbeddingExplorer: A Web Application for Cross-Modal Retrieval of Global Satellite Images},
author={Zheng, Yijie and Wu, Weijie and Wu, Bingyue and Zhao, Long and Li, Guoqing and Czerkawski, Mikolaj and Klemmer, Konstantin},
journal={arXiv preprint arXiv:2603.29441},
year={2026},
note={ICLR 2026 Workshop ML4RS Tutorial Track (oral)}
}
bibtex
@inproceedings{francis2024majortom,
title={Major TOM: Expandable Datasets for Earth Observation},
author={Francis, Alistair and Czerkawski, Mikolaj and others},
year={2024},
booktitle={IGARSS 2024},
eprint={2402.12095},
archivePrefix={arXiv}
}
## 许可协议
本数据集基于[CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/)许可协议发布。
提供机构:
maas
创建时间:
2026-04-17



