Core-S2RGB-SigLIP
收藏魔搭社区2026-01-06 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/Major-TOM/Core-S2RGB-SigLIP
下载链接
链接失效反馈官方服务:
资源简介:

# Core-S2RGB-SigLIP 🔴🟢🔵
| Modality | Number of Embeddings | Sensing Type | Comments | Source Dataset | Source Model | Size |
|:---------------------:|:------------------:|:--------------:|:----------:|:--------------:|:----------:|:--------------:|
| Sentinel-2 Level 2A (RGB) | 20,212,974 | True Colour | Vision-Language Global | [Core-S2L2A](https://huggingface.co/datasets/Major-TOM/Core-S2L2A) | [SigLIP-SO400M-384](https://huggingface.co/docs/transformers/en/model_doc/siglip) | 41.3 GB|
## Content
| Field | Type | Description |
|:-----------------:|:--------:|-----------------------------------------------------------------------------|
| unique_id | string | hash generated from geometry, time, product_id, and embedding model |
| embedding | array | raw embedding array |
| grid_cell | string | Major TOM cell |
| grid_row_u | int | Major TOM cell row |
| grid_col_r | int | Major TOM cell col |
| product_id | string | ID of the original product |
| timestamp | string | Timestamp of the sample |
| centre_lat | float | Centre of the fragment latitude |
| centre_lon | float | Centre of the fragment longitude |
| geometry | geometry | Polygon footprint (WGS84) of the fragment |
| utm_footprint | string | Polygon footprint (image UTM) of the fragment |
| utm_crs | string | CRS of the original product |
| pixel_bbox | bbox | Boundary box of the fragment (pixels) |
## Input Data
* Sentinel-2 (Level 2A) RGB reflectance multiplied by 2.5 and clipped between 0 and 1 to resemble images in the training data
* All samples from [**MajorTOM Core-S2LA**](https://huggingface.co/datasets/Major-TOM/Core-S2L2A)
* Image input size: **384 x 384** pixels, target overlap: 10%, border_shift: True
## Model
The image encoder of the [**SigLIP model**](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) vision-language model was used to extract embeddings.
As a result, it is possible to analyse these embeddings together with the output of the text encoder as often done with natural images.
## Example Use
Interface scripts are available at
```python
from datasets import load_dataset
dataset = load_dataset("Major-TOM/Core-S2RGB-SigLIP")
```
## Generate Your Own Major TOM Embeddings
The [**embedder**](https://github.com/ESA-PhiLab/Major-TOM/tree/main/src/embedder) subpackage of Major TOM provides tools for generating embeddings like these ones. You can see an example of this in a dedicated notebook at https://github.com/ESA-PhiLab/Major-TOM/blob/main/05-Generate-Major-TOM-Embeddings.ipynb.
[](https://github.com/ESA-PhiLab/Major-TOM/blob/main/05-Generate-Major-TOM-Embeddings.ipynb)
---
## Major TOM Global Embeddings Project 🏭
This dataset is a result of a collaboration between [**CloudFerro**](https://cloudferro.com/) 🔶 and [**Φ-lab, European Space Agency (ESA)**](https://philab.esa.int/) 🛰️ set up in order to provide open and free vectorised expansions of Major TOM datasets and define a standardised manner for releasing Major TOM embedding expansions.
The embeddings extracted from common AI models make it possible to browse and navigate large datasets like Major TOM with reduced storage and computational demand.
The datasets were computed on the [**GPU-accelerated instances**](https://cloudferro.com/ai/ai-computing-services/)⚡ provided by [**CloudFerro**](https://cloudferro.com/) 🔶 on the [**CREODIAS**](https://creodias.eu/) cloud service platform 💻☁️.
Discover more at [**CloudFerro AI services**](https://cloudferro.com/ai/).
## Authors
[**Mikolaj Czerkawski**](https://mikonvergence.github.io) (Φ-lab, European Space Agency), [**Marcin Kluczek**](https://www.linkedin.com/in/marcin-kluczek-03852a1a8/) (CloudFerro), [**Jędrzej S. Bojanowski**](https://www.linkedin.com/in/j%C4%99drzej-s-bojanowski-a5059872/) (CloudFerro)
## Open Access Manuscript
This dataset is an output from the embedding expansion project outlined in: [https://arxiv.org/abs/2412.05600/](https://arxiv.org/abs/2412.05600/).
[](https://doi.org/10.48550/arXiv.2412.05600)
<details>
<summary>Read Abstract</summary>
> With the ever-increasing volumes of the Earth observation data present in the archives of large programmes such as Copernicus, there is a growing need for efficient vector representations of the underlying raw data. The approach of extracting feature representations from pretrained deep neural networks is a powerful approach that can provide semantic abstractions of the input data. However, the way this is done for imagery archives containing geospatial data has not yet been defined. In this work, an extension is proposed to an existing community project, Major TOM, focused on the provision and standardization of open and free AI-ready datasets for Earth observation. Furthermore, four global and dense embedding datasets are released openly and for free along with the publication of this manuscript, resulting in the most comprehensive global open dataset of geospatial visual embeddings in terms of covered Earth's surface.
> </details>
If this dataset was useful for you work, it can be cited as:
```latex
@misc{EmbeddedMajorTOM,
title={Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space},
author={Mikolaj Czerkawski and Marcin Kluczek and Jędrzej S. Bojanowski},
year={2024},
eprint={2412.05600},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.05600},
}
```
Powered by [Φ-lab, European Space Agency (ESA) 🛰️](https://philab.esa.int/) in collaboration with [CloudFerro 🔶](https://cloudferro.com/)

# Core-S2RGB-SigLIP 🔴🟢🔵
| 模态 | 嵌入向量数量 | 感知类型 | 备注 | 源数据集 | 源模型 | 大小 |
|:---------------------:|:------------------:|:--------------:|:----------:|:--------------:|:----------:|:--------------:|
| 哨兵2号(Sentinel-2)L2A级RGB产品 | 20,212,974 | 真彩色 | 视觉语言全局型 | [Core-S2L2A](https://huggingface.co/datasets/Major-TOM/Core-S2L2A) | [SigLIP-SO400M-384](https://huggingface.co/docs/transformers/en/model_doc/siglip) | 41.3 GB |
## 内容
| 字段名 | 数据类型 | 描述 |
|:-----------------:|:--------:|-----------------------------------------------------------------------------|
| unique_id | 字符串 | 由几何信息、时间戳、产品ID以及嵌入模型生成的哈希值 |
| embedding | 数组 | 原始嵌入向量数组 |
| grid_cell | 字符串 | Major TOM网格单元 |
| grid_row_u | 整数 | Major TOM网格单元行号 |
| grid_col_r | 整数 | Major TOM网格单元列号 |
| product_id | 字符串 | 原始产品的ID |
| timestamp | 字符串 | 样本时间戳 |
| centre_lat | 浮点数 | 影像片段中心点纬度 |
| centre_lon | 浮点数 | 影像片段中心点经度 |
| geometry | 几何对象(geometry) | 影像片段的多边形覆盖范围(WGS84坐标系) |
| utm_footprint | 字符串 | 影像片段的多边形覆盖范围(图像UTM坐标系) |
| utm_crs | 字符串 | 原始产品的坐标参考系(CRS) |
| pixel_bbox | 边界框(bbox) | 影像片段的像素边界框 |
## 输入数据
* 哨兵2号(Sentinel-2)L2A级RGB反射率数据经2.5倍缩放,并裁剪至0至1区间,以匹配训练数据中的图像样式
* 所有样本均来自[**MajorTOM Core-S2L2A**](https://huggingface.co/datasets/Major-TOM/Core-S2L2A)
* 图像输入尺寸:**384×384** 像素,目标重叠率:10%,边界偏移:开启
## 模型
本研究使用[**SigLIP视觉语言模型(SigLIP)**](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384)的图像编码器来提取嵌入向量。借此,可如处理自然图像那般,将该嵌入向量与文本编码器的输出进行联合分析。
## 示例用法
可通过以下代码调用接口脚本:
python
from datasets import load_dataset
dataset = load_dataset("Major-TOM/Core-S2RGB-SigLIP")
## 生成自定义Major TOM嵌入向量
Major TOM的[**embedder**](https://github.com/ESA-PhiLab/Major-TOM/tree/main/src/embedder)子包提供了生成此类嵌入向量的工具。相关示例可参阅专用Jupyter Notebook:https://github.com/ESA-PhiLab/Major-TOM/blob/main/05-Generate-Major-TOM-Embeddings.ipynb。
[](https://github.com/ESA-PhiLab/Major-TOM/blob/main/05-Generate-Major-TOM-Embeddings.ipynb)
---
## Major TOM全局嵌入向量项目 🏭
本数据集由[**CloudFerro**](https://cloudferro.com/) 🔶与[**欧洲空间局Φ实验室(Φ-lab, ESA)**](https://philab.esa.int/) 🛰️合作开发,旨在为Major TOM数据集提供开放免费的矢量化扩展,并定义标准化的Major TOM嵌入向量扩展发布方案。
从通用AI模型中提取的嵌入向量,可在降低存储与计算需求的前提下,实现Major TOM这类大型数据集的浏览与检索。
本数据集在[**CloudFerro**](https://cloudferro.com/) 🔶于[**CREODIAS**](https://creodias.eu/)云服务平台 💻☁️ 提供的[**GPU加速实例**](https://cloudferro.com/ai/ai-computing-services/)⚡ 上完成计算。更多信息可参阅[**CloudFerro人工智能服务**](https://cloudferro.com/ai/)。
## 作者
[**Mikolaj Czerkawski**](https://mikonvergence.github.io)(欧洲空间局Φ实验室)、[**Marcin Kluczek**](https://www.linkedin.com/in/marcin-kluczek-03852a1a8/)(CloudFerro)、[**Jędrzej S. Bojanowski**](https://www.linkedin.com/in/j%C4%99drzej-s-bojanowski-a5059872/)(CloudFerro)
## 开放获取论文
本数据集为嵌入向量扩展项目的研究成果,相关详情可参阅:[https://arxiv.org/abs/2412.05600/](https://arxiv.org/abs/2412.05600/)。
[](https://doi.org/10.48550/arXiv.2412.05600)
<details>
<summary>查看摘要</summary>
> 随着哥白尼计划等大型项目的存档中地球观测数据量持续增长,对原始数据的高效矢量化表示的需求日益迫切。从预训练深度学习神经网络中提取特征表示是一种强大的方法,可对输入数据提供语义抽象。但针对包含地理空间数据的影像档案的此类处理方式尚未形成统一标准。本研究针对现有社区项目Major TOM提出扩展方案,该项目旨在为地球观测领域提供开放免费且适配AI的标准化数据集。此外,本研究随论文发表同步公开发布4个全局密集型嵌入向量数据集,形成了目前覆盖地球表面最全面的公开地理空间视觉嵌入向量数据集。
</details>
若本数据集对您的研究有所帮助,可按以下格式引用:
latex
@misc{EmbeddedMajorTOM,
title={Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space},
author={Mikolaj Czerkawski and Marcin Kluczek and Jędrzej S. Bojanowski},
year={2024},
eprint={2412.05600},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.05600},
}
由[**欧洲空间局Φ实验室(Φ-lab, ESA) 🛰️**](https://philab.esa.int/)与[**CloudFerro 🔶**](https://cloudferro.com/)联合支持出品
提供机构:
maas
创建时间:
2025-08-26



