five

Core-S2L2A-249k

收藏
魔搭社区2026-05-02 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k
下载链接
链接失效反馈
官方服务:
资源简介:
# Core-S2L2A-249k **Core-S2L2A-249k** is a curated subset of the [Major-TOM](https://github.com/ESA-PhiLab/MajorTOM) Core-S2L2A dataset, containing 248,719 Sentinel-2 L2A image patches uniformly sampled across the globe. It serves as the **source imagery** for pre-computed embedding datasets used by the [EarthEmbeddingExplorer](https://modelscope.ai/studios/Major-TOM/EarthEmbeddingExplorer) cross-modal retrieval web application. ## Overview | Property | Value | |----------|-------| | Source | Major-TOM Core-S2L2A | | Sensor | Sentinel-2 MSI (Level-2A) | | Number of patches | 248,719 | | Patch size | 384 × 384 pixels | | Spectral bands | 12 (B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B11, B12) | | Spatial resolution | 10 m / 20 m / 60 m (depending on band) | | Output format | GeoParquet | | License | CC-BY-SA-4.0 | ## Sampling Strategy This dataset was created by **uniform grid sampling** from the full Major-TOM Core-S2L2A archive: 1. A global grid is overlaid on the Major-TOM tile index. 2. For every **1/9 sampled grid cell**, the central bounding box is selected as the crop region. 3. Each crop is extracted at a fixed size of **384 × 384 pixels** to ensure consistency across all downstream embedding models. This strategy guarantees spatial diversity while keeping the dataset size manageable for large-scale embedding generation and interactive web retrieval. ## File Layout ``` . ├── metadata_249k.parquet # Metadata and geospatial index (248,719 rows) └── images_249k/ ├── part_00001.parquet # Image patches shard 1 ├── part_00002.parquet # Image patches shard 2 └── ... ``` ### Metadata Schema (`metadata_249k.parquet`) | Column | Type | Description | |--------|------|-------------| | `product_id` | string | Original Sentinel-2 product identifier | | `timestamp` | datetime | Acquisition timestamp | | `grid_cell` | string | Major-TOM grid cell identifier | | `grid_row_u` | int16 | Grid row index | | `grid_col_r` | int16 | Grid column index | | `geometry` | geometry | WGS-84 polygon (footprint) | | `centre_lat` | float32 | Latitude of patch centre | | `centre_lon` | float32 | Longitude of patch centre | | `utm_footprint` | string | Original UTM footprint as WKT | | `utm_crs` | string | UTM CRS (e.g. EPSG:32633) | | `pixel_bbox` | list<int> | Pixel bounding box [x_min, y_min, x_max, y_max] | | `parquet_url` | string | Path to the image shard containing this patch | | `parquet_row` | int64 | Row index within the image shard | ### Image Patch Schema (`images_249k/part_*.parquet`) Each row contains a single 384 × 384 Sentinel-2 L2A patch stored as a 3-D array with shape `(384, 384, 12)` in uint16 format. The 12 bands follow the order: `[B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B11, B12]` ## Pre-computed Embedding Datasets The following embedding datasets were computed from **Core-S2L2A-249k** using various foundation models. All embeddings share the same 248,719 samples and geospatial metadata, enabling fair cross-model comparison. | Filename | Embedding Model | Crop Size | Model Input Size | Embedding Dim | Source | |----------|-----------------|-----------|------------------|---------------|--------| | `SigLIP_crop_384x384.parquet` | [SigLIP (ViT-SO400M-14-SigLIP-384)](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) | 384×384 | 384×384 | 1152 | [Major-TOM/Core-S2RGB-249k-SigLIP](https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-249k-SigLIP) | | `FarSLIP_crop_384x384.parquet` | [FarSLIP (ViT-B-16)](https://huggingface.co/ZhenShiL/FarSLIP) | 384×384 | 224×224 | 512 | [Major-TOM/Core-S2RGB-249k-FarSLIP](https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-249k-FarSLIP) | | `DINOv2_crop_384x384.parquet` | [DINOv2-large](https://huggingface.co/facebook/dinov2-large) | 384×384 | 224×224 | 1024 | [Major-TOM/Core-S2RGB-249k-DINOv2](https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-249k-DINOv2) | | `SatCLIP_crop_384x384.parquet` | [SatCLIP (ViT16-L40)](https://github.com/microsoft/satclip) | 384×384 | 224×224 | 256 | [Major-TOM/Core-S2L2A-249k-SatCLIP](https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k-SatCLIP) | | `Clay_crop_384x384.parquet` | [Clay v1.5](https://github.com/Clay-foundation/model) | 384×384 | 384×384 | 1024 | [Major-TOM/Core-S2L2A-249k-Clay-v1_5](https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k-Clay-v1_5) | | `OLMoEarth_Base_crop_384x384.parquet` | [OLMoEarth-Base](https://huggingface.co/allenai/OLMoEarth-Base-WS) | 384×384 | 128×128 | 768 | [Major-TOM/Core-S2L2A-249k-OlmoEarth](https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k-OlmoEarth) | > **Note**: SigLIP, FarSLIP, and DINOv2 use RGB-only inputs (B04, B03, B02). SatCLIP, Clay, and OLMoEarth use multi-spectral inputs. ## Usage ```python import pandas as pd # Load metadata meta = pd.read_parquet("metadata_249k.parquet") print(len(meta), "patches") # Load image patches from a shard images = pd.read_parquet("images_249k/part_00001.parquet") print(images.iloc[0]["B04"].shape) # (384, 384) ``` ## Web Application Explore these embeddings interactively with the [EarthEmbeddingExplorer](https://modelscope.ai/studios/Major-TOM/EarthEmbeddingExplorer) web application, supporting text, image, and geolocation queries. ## Citation If you use this dataset, please cite both the EarthEmbeddingExplorer tutorial paper and the original Major-TOM paper: ```bibtex @article{zheng2026earthembeddingexplorer, title={EarthEmbeddingExplorer: A Web Application for Cross-Modal Retrieval of Global Satellite Images}, author={Zheng, Yijie and Wu, Weijie and Wu, Bingyue and Zhao, Long and Li, Guoqing and Czerkawski, Mikolaj and Klemmer, Konstantin}, journal={arXiv preprint arXiv:2603.29441}, year={2026}, note={ICLR 2026 Workshop ML4RS Tutorial Track (oral)} } ``` ```bibtex @inproceedings{francis2024majortom, title={Major TOM: Expandable Datasets for Earth Observation}, author={Francis, Alistair and Czerkawski, Mikolaj and others}, year={2024}, booktitle={IGARSS 2024}, eprint={2402.12095}, archivePrefix={arXiv} } ``` ## License This dataset is released under the [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/) license.

# Core-S2L2A-249k数据集 **Core-S2L2A-249k** 是[Major-TOM](https://github.com/ESA-PhiLab/MajorTOM)的Core-S2L2A数据集的精选子集,包含248719幅全球均匀采样的哨兵2号L2A级产品(Sentinel-2 L2A)图像块。该数据集作为预计算嵌入数据集的**源图像**,供[EarthEmbeddingExplorer](https://modelscope.ai/studios/Major-TOM/EarthEmbeddingExplorer)跨模态检索Web应用使用。 ## 概览 | 属性 | 取值 | |----------|-------| | 数据来源 | Major-TOM Core-S2L2A | | 传感器 | 哨兵2号多光谱仪器(Sentinel-2 MSI,Level-2A) | | 图像块数量 | 248,719 | | 图像块尺寸 | 384 × 384 像素 | | 光谱波段 | 12个(B01、B02、B03、B04、B05、B06、B07、B08、B8A、B09、B11、B12) | | 空间分辨率 | 10米/20米/60米(依波段而定) | | 输出格式 | GeoParquet | | 许可协议 | CC-BY-SA-4.0 | ## 采样策略 本数据集通过**均匀网格采样**从完整的Major-TOM Core-S2L2A数据集归档中生成: 1. 将全球网格叠加至Major-TOM瓦片索引中。 2. 针对每1/9的采样网格单元,选取其中心边界框作为裁剪区域。 3. 所有裁剪区域统一裁剪为384 × 384像素的固定尺寸,以确保所有下游嵌入模型的输入一致性。 该采样策略在保证空间多样性的同时,将数据集规模控制在便于开展大规模嵌入生成与交互式Web检索的范围内。 ## 文件布局 . ├── metadata_249k.parquet # 元数据与地理空间索引(含248,719条记录) └── images_249k/ ├── part_00001.parquet # 图像块分片1 ├── part_00002.parquet # 图像块分片2 └── ... ### 元数据模式(`metadata_249k.parquet`) | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `product_id` | 字符串 | 原始Sentinel-2产品标识符 | | `timestamp` | 日期时间 | 采集时间戳 | | `grid_cell` | 字符串 | Major-TOM网格单元标识符 | | `grid_row_u` | int16 | 网格行索引 | | `grid_col_r` | int16 | 网格列索引 | | `geometry` | 几何类型 | WGS-84多边形(覆盖范围) | | `centre_lat` | float32 | 图像块中心纬度 | | `centre_lon` | float32 | 图像块中心经度 | | `utm_footprint` | 字符串 | 原始UTM覆盖范围的熟知文本(WKT)格式 | | `utm_crs` | 字符串 | UTM坐标参考系(例如EPSG:32633) | | `pixel_bbox` | 整数列表 | 像素边界框 [x_min, y_min, x_max, y_max] | | `parquet_url` | 字符串 | 包含该图像块的分片文件路径 | | `parquet_row` | int64 | 该图像块在分片中的行索引 | ### 图像块模式(`images_249k/part_*.parquet`) 每一行包含单个384 × 384像素的Sentinel-2 L2A图像块,以uint16格式存储为形状为`(384, 384, 12)`的三维数组。12个波段的顺序为: `[B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B11, B12]` ## 预计算嵌入数据集 以下嵌入数据集均基于**Core-S2L2A-249k**通过多种基础模型计算得到。所有嵌入数据集均包含248,719条样本与对应的地理空间元数据,可实现公平的跨模型对比。 | 文件名 | 嵌入模型 | 裁剪尺寸 | 模型输入尺寸 | 嵌入维度 | 数据来源 | |----------|-----------------|-----------|------------------|---------------|--------| | `SigLIP_crop_384x384.parquet` | [SigLIP (ViT-SO400M-14-SigLIP-384)](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) | 384×384 | 384×384 | 1152 | [Major-TOM/Core-S2RGB-249k-SigLIP](https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-249k-SigLIP) | | `FarSLIP_crop_384x384.parquet` | [FarSLIP (ViT-B-16)](https://huggingface.co/ZhenShiL/FarSLIP) | 384×384 | 224×224 | 512 | [Major-TOM/Core-S2RGB-249k-FarSLIP](https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-249k-FarSLIP) | | `DINOv2_crop_384x384.parquet` | [DINOv2-large](https://huggingface.co/facebook/dinov2-large) | 384×384 | 224×224 | 1024 | [Major-TOM/Core-S2RGB-249k-DINOv2](https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-249k-DINOv2) | | `SatCLIP_crop_384x384.parquet` | [SatCLIP (ViT16-L40)](https://github.com/microsoft/satclip) | 384×384 | 224×224 | 256 | [Major-TOM/Core-S2L2A-249k-SatCLIP](https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k-SatCLIP) | | `Clay_crop_384x384.parquet` | [Clay v1.5](https://github.com/Clay-foundation/model) | 384×384 | 384×384 | 1024 | [Major-TOM/Core-S2L2A-249k-Clay-v1_5](https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k-Clay-v1_5) | | `OLMoEarth_Base_crop_384x384.parquet` | [OLMoEarth-Base](https://huggingface.co/allenai/OLMoEarth-Base-WS) | 384×384 | 128×128 | 768 | [Major-TOM/Core-S2L2A-249k-OlmoEarth](https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-249k-OlmoEarth) | > **注意**:SigLIP、FarSLIP与DINOv2仅使用RGB波段(B04、B03、B02)作为输入;SatCLIP、Clay与OLMoEarth则使用多光谱波段作为输入。 ## 使用示例 python import pandas as pd # 加载元数据 meta = pd.read_parquet("metadata_249k.parquet") print(len(meta), "个图像块") # 从分片中加载图像块 images = pd.read_parquet("images_249k/part_00001.parquet") print(images.iloc[0]["B04"].shape) # (384, 384) ## Web应用 可通过[EarthEmbeddingExplorer](https://modelscope.ai/studios/Major-TOM/EarthEmbeddingExplorer) Web应用交互式探索这些嵌入数据集,该应用支持文本、图像与地理定位查询。 ## 引用 若使用本数据集,请同时引用EarthEmbeddingExplorer教程论文与原始Major-TOM论文: bibtex @article{zheng2026earthembeddingexplorer, title={EarthEmbeddingExplorer: A Web Application for Cross-Modal Retrieval of Global Satellite Images}, author={Zheng, Yijie and Wu, Weijie and Wu, Bingyue and Zhao, Long and Li, Guoqing and Czerkawski, Mikolaj and Klemmer, Konstantin}, journal={arXiv preprint arXiv:2603.29441}, year={2026}, note={ICLR 2026 Workshop ML4RS Tutorial Track (oral)} } bibtex @inproceedings{francis2024majortom, title={Major TOM: Expandable Datasets for Earth Observation}, author={Francis, Alistair and Czerkawski, Mikolaj and others}, year={2024}, booktitle={IGARSS 2024}, eprint={2402.12095}, archivePrefix={arXiv} } ## 许可协议 本数据集基于[CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/)许可协议发布。
提供机构:
maas
创建时间:
2026-04-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作