five

Core-S2L2A-MMEarth

收藏
魔搭社区2025-12-05 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/Major-TOM/Core-S2L2A-MMEarth
下载链接
链接失效反馈
官方服务:
资源简介:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6304c06eeb6d777a838eab63/sVrY-9IqX5W5W5Aj9yeNq.png) # Core-S2L2A-MMEarth (Pooled) 🟥🟩🟦🟧🟨🟪 🛰️ > This is a pooled down (about 10x) version of the computed dataset due to storage constraints on HuggingFace. For a full size access, please visit [**Creodias EODATA**](https://creodias.eu/eodata/all-sources/). ## Input data * Sentinel-2 (Level 2A) multispectral dataset global coverage * All samples from [**MajorTOM Core-S2L2A**](https://huggingface.co/datasets/Major-TOM/Core-S2L2A) * Embedding_shape = **(320, 133, 133)** * Pooled shape = **(320, 13, 13)** ## Metadata content | Field | Type | Description | |:-----------------:|:--------:|-----------------------------------------------------------------------------| | unique_id | string | hash generated from geometry, time, product_id, and average embedding (320,1,1) | | grid_cell | string | Major TOM cell | | grid_row_u | int | Major TOM cell row | | grid_col_r | int | Major TOM cell col | | product_id | string | ID of the original product | | timestamp | string | Timestamp of the sample | | centre_lat | float | Centre of the of the grid_cell latitude | | centre_lon | float | Centre of the of the grid_cell longitude | | geometry | geometry | Polygon footprint (WGS84) of the grid_cell | | utm_footprint | string | Polygon footprint (image UTM) of the grid_cell | | utm_crs | string | CRS of the original product | | file_name | string | Name of reference MajorTOM product | | file_index | int | Position of the embedding within the .dat file | ## Model The image encoder of the [**MMEarth model**](https://github.com/vishalned/MMEarth-train) was used to extract embeddings Model [**weights**](https://sid.erda.dk/cgi-sid/ls.py?share_id=g23YOnaaTp&current_dir=pt-all_mod_atto_1M_64_uncertainty_56-8&flags=f) Weights info: **pt-all_mod_atto_1M_64_uncertainty_56-8** - **INFO**: pt-($INPUT)_($MODEL)_($DATA)_($LOSS)_($MODEL_IMG_SIZE)_($PATCH_SIZE) - **INPUT:** all_mod # for s2-12 bands as input and all modalities as output - **MODEL:** atto - **DATA:** 1M_64 # MMEarth64, 1.2M locations and image size 64 - **LOSS:** uncertainty - **MODEL_IMG_SIZE:** 56 # when using the data with image size 64 - **PATCH_SIZE:** 8 ## Example Use Interface scripts are available at ```python import numpy as np input_file_path = 'processed_part_00045_pooled.dat' # Path to the saved .dat file pooled_shape=(320, 13, 13) embedding_size = np.prod(pooled_shape) dtype_size = np.dtype(np.float32).itemsize # Calculate the byte offset for the embedding you want to read embedding_index = 4 offset = embedding_index * embedding_size * dtype_size # Load the specific embedding with open(file_path, 'rb') as f: f.seek(offset) embedding_data = np.frombuffer(f.read(embedding_size * dtype_size), dtype=np.float32) embedding = embedding_data.reshape(pooled_shape) # Reshape to the pooled embedding shape embedding ``` ## Generate Your Own Major TOM Embeddings The [**embedder**](https://github.com/ESA-PhiLab/Major-TOM/tree/main/src/embedder) subpackage of Major TOM provides tools for generating embeddings like these ones. You can see an example of this in a dedicated notebook at https://github.com/ESA-PhiLab/Major-TOM/blob/main/05-Generate-Major-TOM-Embeddings.ipynb. [![GitHub](https://img.shields.io/badge/GitHub-Generate%20Your%20Own%20Embeddings-blue?logo=github&style=flat-square)](https://github.com/ESA-PhiLab/Major-TOM/blob/main/05-Generate-Major-TOM-Embeddings.ipynb) --- ## Major TOM Global Embeddings Project 🏭 This dataset is a result of a collaboration between [**CloudFerro**](https://cloudferro.com/) 🔶, [**asterisk labs**](https://asterisk.coop/) and [**Φ-lab, European Space Agency (ESA)**](https://philab.esa.int/) 🛰️ set up in order to provide open and free vectorised expansions of Major TOM datasets and define a standardised manner for releasing Major TOM embedding expansions. The embeddings extracted from common AI models make it possible to browse and navigate large datasets like Major TOM with reduced storage and computational demand. The datasets were computed on the [**GPU-accelerated instances**](https://cloudferro.com/ai/ai-computing-services/)⚡ provided by [**CloudFerro**](https://cloudferro.com/) 🔶 on the [**CREODIAS**](https://creodias.eu/) cloud service platform 💻☁️. Discover more at [**CloudFerro AI services**](https://cloudferro.com/ai/). ## Authors [**Mikolaj Czerkawski**](https://mikonvergence.github.io) (Asterisk Labs), [**Marcin Kluczek**](https://www.linkedin.com/in/marcin-kluczek-03852a1a8/) (CloudFerro), [**Jędrzej S. Bojanowski**](https://www.linkedin.com/in/j%C4%99drzej-s-bojanowski-a5059872/) (CloudFerro) ## Open Access Manuscript This dataset is an output from the embedding expansion project outlined in: [https://arxiv.org/abs/2412.05600/](https://arxiv.org/abs/2412.05600/). [![arXiv](https://img.shields.io/badge/arXiv-10.48550/arXiv.2412.05600-B31B1B.svg)](https://doi.org/10.48550/arXiv.2412.05600) <details> <summary>Read Abstract</summary> > With the ever-increasing volumes of the Earth observation data present in the archives of large programmes such as Copernicus, there is a growing need for efficient vector representations of the underlying raw data. The approach of extracting feature representations from pretrained deep neural networks is a powerful approach that can provide semantic abstractions of the input data. However, the way this is done for imagery archives containing geospatial data has not yet been defined. In this work, an extension is proposed to an existing community project, Major TOM, focused on the provision and standardization of open and free AI-ready datasets for Earth observation. Furthermore, four global and dense embedding datasets are released openly and for free along with the publication of this manuscript, resulting in the most comprehensive global open dataset of geospatial visual embeddings in terms of covered Earth's surface. > </details> If this dataset was useful for you work, it can be cited as: ```latex @misc{EmbeddedMajorTOM, title={Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space}, author={Mikolaj Czerkawski and Marcin Kluczek and Jędrzej S. Bojanowski}, year={2024}, eprint={2412.05600}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.05600}, } ``` Powered by [Φ-lab, European Space Agency (ESA) 🛰️](https://philab.esa.int/) in collaboration with [CloudFerro 🔶](https://cloudferro.com/) & [asterisk labs](https://asterisk.coop/)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6304c06eeb6d777a838eab63/sVrY-9IqX5W5W5Aj9yeNq.png) # Core-S2L2A-MMEarth(池化版) 🟥🟩🟦🟧🟨🟪 🛰️ > 本数据集为适配HuggingFace的存储限制,制作了约10倍下采样的池化版本。如需获取完整尺寸数据集,请访问[**Creodias EODATA**](https://creodias.eu/eodata/all-sources/)。 ## 输入数据 * Sentinel-2 (Level 2A) 多光谱数据集,覆盖全球范围 * 所有样本均来自[**MajorTOM Core-S2L2A**](https://huggingface.co/datasets/Major-TOM/Core-S2L2A) * 嵌入形状为 **(320, 133, 133)** * 池化形状为 **(320, 13, 13)** ## 元数据内容 | 字段名 | 数据类型 | 描述 | |:-----------------:|:--------:|-----------------------------------------------------------------------------| | unique_id | string | 由几何信息、时间、product_id以及平均嵌入(320,1,1)生成的哈希值 | | grid_cell | string | Major TOM网格单元 | | grid_row_u | int | Major TOM网格单元的行号 | | grid_col_r | int | Major TOM网格单元的列号 | | product_id | string | 原始产品的ID | | timestamp | string | 样本的时间戳 | | centre_lat | float | 网格单元中心的纬度 | | centre_lon | float | 网格单元中心的经度 | | geometry | geometry | 网格单元的多边形覆盖范围(WGS84坐标系) | | utm_footprint | string | 网格单元的UTM坐标系多边形覆盖范围 | | utm_crs | string | 原始产品的坐标参考系统(CRS) | | file_name | string | 参考MajorTOM产品的名称 | | file_index | int | .dat文件中嵌入的位置索引 | ## 模型 使用[**MMEarth模型**](https://github.com/vishalned/MMEarth-train)的图像编码器提取嵌入特征。 模型[**权重文件**](https://sid.erda.dk/cgi-sid/ls.py?share_id=g23YOnaaTp&current_dir=pt-all_mod_atto_1M_64_uncertainty_56-8&flags=f) 权重信息: **pt-all_mod_atto_1M_64_uncertainty_56-8** - **说明**:pt-($INPUT)_($MODEL)_($DATA)_($LOSS)_($MODEL_IMG_SIZE)_($PATCH_SIZE) - **INPUT:** all_mod # 输入为s2-12波段,输出为所有模态 - **MODEL:** atto - **DATA:** 1M_64 # 对应MMEarth64,包含120万个采样点,图像尺寸为64 - **LOSS:** uncertainty # 不确定性损失 - **MODEL_IMG_SIZE:** 56 # 当使用图像尺寸为64的数据集时,模型输入尺寸为56 - **PATCH_SIZE:** 8 # 补丁尺寸为8 ## 使用示例 接口脚本如下: python import numpy as np input_file_path = 'processed_part_00045_pooled.dat' # 目标.dat文件路径 pooled_shape=(320, 13, 13) embedding_size = np.prod(pooled_shape) dtype_size = np.dtype(np.float32).itemsize # 计算所需嵌入的字节偏移量 embedding_index = 4 offset = embedding_index * embedding_size * dtype_size # 加载指定嵌入 with open(input_file_path, 'rb') as f: f.seek(offset) embedding_data = np.frombuffer(f.read(embedding_size * dtype_size), dtype=np.float32) embedding = embedding_data.reshape(pooled_shape) # 重塑为池化嵌入形状 embedding ## 自行生成Major TOM嵌入特征 Major TOM的[**嵌入工具包**](https://github.com/ESA-PhiLab/Major-TOM/tree/main/src/embedder)子包提供了生成此类嵌入特征的工具。相关示例可参考专用笔记本:https://github.com/ESA-PhiLab/Major-TOM/blob/main/05-Generate-Major-TOM-Embeddings.ipynb。 [![GitHub](https://img.shields.io/badge/GitHub-Generate%20Your%20Own%20Embeddings-blue?logo=github&style=flat-square)](https://github.com/ESA-PhiLab/Major-TOM/blob/main/05-Generate-Major-TOM-Embeddings.ipynb) --- ## Major TOM全球嵌入特征项目 🏭 本数据集由[**CloudFerro**](https://cloudferro.com/) 🔶、[**asterisk labs**](https://asterisk.coop/)以及[**欧洲空间局Φ实验室(ESA Φ-lab)**](https://philab.esa.int/) 🛰️ 合作开发,旨在为Major TOM数据集提供开放免费的向量化扩展,并定义标准化的Major TOM嵌入扩展发布方式。 从通用AI模型中提取的嵌入特征能够降低存储与计算开销,实现对Major TOM等大型数据集的高效浏览与检索。 本数据集的计算依托于[**CloudFerro**](https://cloudferro.com/) 🔶 在[**CREODIAS**](https://creodias.eu/)云服务平台上提供的[**GPU加速实例**](https://cloudferro.com/ai/ai-computing-services/)⚡。更多信息可访问[**CloudFerro人工智能服务**](https://cloudferro.com/ai/)。 ## 作者 [**Mikolaj Czerkawski**](https://mikonvergence.github.io)(Asterisk Labs)、[**Marcin Kluczek**](https://www.linkedin.com/in/marcin-kluczek-03852a1a8/)(CloudFerro)、[**Jędrzej S. Bojanowski**](https://www.linkedin.com/in/j%C4%99drzej-s-bojanowski-a5059872/)(CloudFerro) ## 开放获取论文 本数据集源自以下论文中的嵌入扩展项目:[https://arxiv.org/abs/2412.05600/](https://arxiv.org/abs/2412.05600/)。 [![arXiv](https://img.shields.io/badge/arXiv-10.48550/arXiv.2412.05600-B31B1B.svg)](https://doi.org/10.48550/arXiv.2412.05600) <details> <summary>查看摘要</summary> > 随着哥白尼计划(Copernicus)等大型项目的档案中地球观测数据量持续增长,对原始数据的高效向量表示的需求日益迫切。从预训练深度学习神经网络中提取特征表示是一种强大的方法,能够为输入数据提供语义抽象。然而,针对包含地理空间数据的影像档案的此类处理方式尚未形成标准。本文提出了对现有社区项目Major TOM的扩展,该项目专注于为地球观测领域提供开放免费且适配AI的标准化数据集。此外,本文还公开发布了四个全球密集型嵌入数据集,结合本论文的发表,形成了目前覆盖地球表面最全面的全球开放地理空间视觉嵌入数据集。 > </details> 如果本数据集对您的研究有所帮助,请引用如下文献: latex @misc{EmbeddedMajorTOM, title={Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space}, author={Mikolaj Czerkawski and Marcin Kluczek and Jędrzej S. Bojanowski}, year={2024}, eprint={2412.05600}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.05600}, } 本项目由[**欧洲空间局Φ实验室(ESA Φ-lab)** 🛰️](https://philab.esa.int/)与[**CloudFerro 🔶**](https://cloudferro.com/)及[**asterisk labs**](https://asterisk.coop/)合作开发。
提供机构:
maas
创建时间:
2025-08-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作