TerraMesh

Name: TerraMesh
Creator: maas
Published: 2025-10-10 22:09:28
License: 暂无描述

魔搭社区2025-10-10 更新2025-08-30 收录

下载链接：

https://modelscope.cn/datasets/ibm-esa-geospatial/TerraMesh

下载链接

链接失效反馈

官方服务：

资源简介：

# TerraMesh A planetary‑scale, multimodal analysis‑ready dataset for Earth‑Observation foundation models: **TerraMesh** merges data from **Sentinel‑1 SAR, Sentinel‑2 optical, Copernicus DEM, NDVI, and land‑cover** sources into more than **9 million co‑registered patches** ready for large‑scale representation learning. You find more information about the data sampling and preprocessing in our paper: [TerraMesh: A Planetary Mosaic of Multimodal Earth Observation Data](https://arxiv.org/abs/2504.11172). ![Examples from TerraMesh](assets%2Fexamples.png) Samples from the TerraMesh dataset with seven spatiotemporal aligned modalities. Sentinel-2 L2A uses IRRG pseudo-coloring and Sentinel-1 RTC is visualized in db scale as VH-VV-VV/VH. Copernicus DEM is scaled based on the image value range with an additional 10 meter buffer to highlight flat scenes. --- ## Dataset organisation The archive ships two top‑level splits `train/` and `val/`, each holding one folder per modality. `terramesh.py` includes code for data loading, see [Usage](#Usage). ```text TerraMesh ├── train │ ├── DEM │ ├── LULC │ ├── NDVI │ ├── S1GRD │ ├── S1RTC │ ├── S2L1C │ ├── S2L2A │ └── S2RGB ├── val │ ├── DEM │ └── ... └── terramesh.py ``` Each folder includes up to 889 shard files, containing up to 10240 samples each. Samples from MajorTom-Core are stored in shards with the pattern `majortom_{split}_{id}.tar` while shards with SSL4EO-S12 samples start with `ssl4eos12_`. Samples are stored as Zarr Zip files which can be loaded with `zarr` (Version Size: 283kB Dimensions: (band: 2, time: 1, y: 264, x: 264) Coordinates: * band (band) = 0.16. ### Download You can download the dataset with the Hugging Face CLI tool. Please note that the full dataset requires 17TB or storage. ```shell hf download ibm-esa-geospatial/TerraMesh --repo-type dataset --local-dir data/TerraMesh ``` If you like to download only a subset of the data, you can specify it with `--include`. ``` # Only download val data hf download ibm-esa-geospatial/TerraMesh --repo-type dataset --include "val/*" --local-dir data/TerraMesh # Only download a single modality (e.g., S2L2A) hf download ibm-esa-geospatial/TerraMesh --repo-type dataset --include "*/S2L2A/*" --local-dir data/TerraMesh ``` It is recommend to be be logged in with the Hugging Face cli tool (check with `hf auth whoami`) to avoild hitting download limits and using a reduced number of workers (`--max-worker 4`). When the download fails, just rerun the command to autoresume it. ### Data loader We provide the data loading code in `terramesh.py` which is downloaded together with the dataset. For development use streaming, you can download the file via this [link](https://huggingface.co/datasets/ibm-esa-geospatial/TerraMesh/resolve/main/terramesh.py) or with: ``` wget https://huggingface.co/datasets/ibm-esa-geospatial/TerraMesh/resolve/main/terramesh.py ``` You can use the `build_terramesh_dataset` function to initialize a dataset, which uses the WebDataset package to load samples from the shard files. You can stream the data from Hugging Face using the urls or download the full dataset and pass a local path (e.g, `data/TerraMesh/`). ```python from terramesh import build_terramesh_dataset from torch.utils.data import DataLoader # If you only pass one modality, the modality is loaded with the "image" key dataset = build_terramesh_dataset( path="https://huggingface.co/datasets/ibm-esa-geospatial/TerraMesh/resolve/main/", # Streaming or local path modalities=["S2L2A"], split="val", shuffle=False, # Set false for split="val" batch_size=8 ) # Batch keys: ["__key__", "__url__", "image"] # If you pass multiple modalities, the modalities are returned using the modality names as keys dataset = build_terramesh_dataset( path="https://huggingface.co/datasets/ibm-esa-geospatial/TerraMesh/resolve/main/", # Streaming or local path modalities=["S2L2A", "S2L1C", "S2RGB", "S1GRD", "S1RTC", "DEM", "NDVI", "LULC"], shuffle=False, # Set false for split="val" split="val", batch_size=8 ) # Set batch size to None because batching is handled by WebDataset. dataloader = DataLoader(dataset, batch_size=None, num_workers=4, persistent_workers=True, prefetch_factor=1) # Iterate over the dataloader for batch in dataloader: print("Batch keys:", list(batch.keys())) # Batch keys: ["__key__", "__url__", "S2L2A", "S2L1C", "S2RGB", "S1RTC", "DEM", "NDVI", "LULC"] # Because S1RTC and S1GRD are not present for all samples, each batch only includes one S1 version. print("Data shape:", batch["S2L2A"].shape) # Data shape: torch.Size([8, 12, 264, 264] # Dimensions [batch, channel, h, w]. The code removes the time dim from the source data. break ``` ### Data transform We provide some additional code for wrapping `albumentations` transform functions. We recommend albumentations because parameters are shared between all image modalities (e.g., same random crop). However, it requires some code wrapping to bring the data into the expected shape. ```python import albumentations as A from albumentations.pytorch import ToTensorV2 from terramesh import build_terramesh_dataset, Transpose, MultimodalTransforms, MultimodalNormalize, statistics # Define all image modalities modalities = ["S2L2A", "S2L1C", "S2RGB", "S1GRD", "S1RTC", "DEM", "NDVI", "LULC"] # Define multimodal transform function that converts the data into the expected shape from albumentations val_transform = MultimodalTransforms( transforms=A.Compose([ # We use albumentations because of the shared transform between image modalities Transpose([1, 2, 0]), # Convert data to channel last (expected shape from albumentations) MultimodalNormalize(mean=statistics["mean"], std=statistics["std"]), A.CenterCrop(224, 224), # Use center crop in val split # A.RandomCrop(224, 224), # Use random crop in train split # A.D4(), # Optionally, use random flipping and rotation for the train split ToTensorV2(), # Convert to tensor and back to channel first ], is_check_shapes=False, # Not needed because of aligned data in TerraMesh additional_targets={m: "image" for m in modalities} ), non_image_modalities=["__key__", "__url__"], # Additional non-image keys ) dataset = build_terramesh_dataset( path="https://huggingface.co/datasets/ibm-esa-geospatial/TerraMesh/resolve/main/", modalities=modalities, split="val", transform=val_transform, batch_size=8, ) ``` If you only use a single modality, you don't need to specify `additional_targets`. You need to change the normalization to: ``` MultimodalNormalize( mean={"image": statistics["mean"][""]}, std={"image": statistics["std"][""]} ), ``` ### Returning metadata You can pass `return_metadata=True` to `build_terramesh_dataset()` to load center longitude and latitude, timestamps, and the S2 cloud mask as additional metadata. The resulting batch keys include: `["__key__", "__url__", "S2L2A", "S1RTC", ..., "center_lon", "center_lat", "cloud_mask", "time_S2L2A", "time_S1RTC", ...]`. Therefore, you need to update the `transform` if you use one: ```python val_transform = MultimodalTransforms( transforms=A.Compose([...], additional_targets={m: "image" for m in modalities + ["cloud_mask"]} ), non_image_modalities=["__key__", "__url__", "center_lon", "center_lat"] + ["time_" + m for m in modalities] ) ``` For a single modality dataset, "time" does not have a suffix and the following changes for the `transform` are required: ```python val_transform = MultimodalTransforms( transforms=A.Compose([...], additional_targets={"cloud_mask": "image"} ), non_image_modalities=["__key__", "__url__", "center_lon", "center_lat", "time"] ) ``` Note that center points are not updated when random crop is used. The cloud mask provides the classes land (0), water (1), snow (2), thin cloud (3), thick cloud (4), cloud shadow (5), and no data (6). DEM does not return a time value while LULC uses the S2 timestamp because of the augmentation using the S2 cloud and ice mask. Time values are returned as integer values but can be converted back to datetime with ```python batch["time_S2L2A"].numpy().astype("datetime64[ns]") ``` If you have any issues with data loading, please create a discussion in the community tab and tag `@blumenstiel`. --- ## Citation If you use TerraMesh, please cite: ```bibtex @article{blumenstiel2025terramesh, title={Terramesh: A planetary mosaic of multimodal earth observation data}, author={Blumenstiel, Benedikt and Fraccaro, Paolo and Marsocci, Valerio and Jakubik, Johannes and Maurogiovanni, Stefano and Czerkawski, Mikolaj and Sedona, Rocco and Cavallaro, Gabriele and Brunschwiler, Thomas and Bernabe-Moreno, Juan and others}, journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, year={2025}, } ``` --- ## License TerraMesh is released under the **Creative Commons Attribution‑ShareAlike 4.0 (CC‑BY‑SA‑4.0)** license. --- ## Acknowledgements TerraMesh is part of the **FAST‑EO** project funded by the European Space Agency Φ‑Lab (contract #4000143501/23/I‑DT). The satellite data (S2L1C, S2L2A, S1GRD, S1RTC) is sourced from the [SSL4EO‑S12 v1.1](https://huggingface.co/datasets/embed2scale/SSL4EO-S12-v1.1) (CC-BY-4.0) and [MajorTOM‑Core](https://huggingface.co/Major-TOM) (CC-BY-SA-4.0) datasets. The LULC data is provided by [ESRI, Impact Observatory, and Microsoft](https://planetarycomputer.microsoft.com/dataset/io-lulc-annual-v02) (CC-BY-4.0). The cloud masks used for augmenting the LULC maps are provided as metadata and are produced using the [SEnSeIv2](https://github.com/aliFrancis/SEnSeIv2/tree/main) model. The DEM data is produced using [Copernicus WorldDEM-30](https://dataspace.copernicus.eu/explore-data/data-collections/copernicus-contributing-missions/collections-description/COP-DEM) © DLR e.V. 2010-2014 and © Airbus Defence and Space GmbH 2014-2018 provided under COPERNICUS by the European Union and ESA; all rights reserved

# TerraMesh 面向地球观测基础模型的行星级多模态可分析数据集：**TerraMesh** 将**哨兵1号合成孔径雷达（Sentinel‑1 SAR）、哨兵2号光学影像（Sentinel‑2 optical）、哥白尼数字高程模型（Copernicus DEM）、归一化植被指数（NDVI）以及土地覆盖（land‑cover）**等多源数据融合为超过900万个配准后的图像块，可直接用于大规模表征学习。有关数据采样与预处理的更多细节，请参阅我们的论文：[TerraMesh：行星尺度多模态地球观测数据镶嵌集](https://arxiv.org/abs/2504.11172)。 ![TerraMesh 示例](assets%2Fexamples.png) TerraMesh 数据集的样本包含7种时空对齐的模态。其中哨兵2号L2A影像采用IRRG伪彩色合成，哨兵1号RTC数据以分贝（dB）尺度按VH-VV-VV/VH的组合方式可视化。哥白尼数字高程模型根据图像数值范围进行缩放，并额外添加10米的缓冲范围以凸显平坦区域。 --- ## 数据集组织该数据集归档包包含两个顶级数据集划分：`train/`（训练集）与`val/`（验证集），每个划分下均按模态分别创建文件夹。`terramesh.py` 提供数据加载代码，详见[使用说明](#Usage)。 text TerraMesh ├── train │ ├── DEM │ ├── LULC │ ├── NDVI │ ├── S1GRD │ ├── S1RTC │ ├── S2L1C │ ├── S2L2A │ └── S2RGB ├── val │ ├── DEM │ └── ... └── terramesh.py 每个文件夹最多包含889个分片文件（shard files），每个分片最多存储10240个样本。来自MajorTom-Core的样本存储在格式为`majortom_{split}_{id}.tar`的分片中，而包含SSL4EO-S12样本的分片则以`ssl4eos12_`为前缀。样本以Zarr压缩包（Zarr Zip files）格式存储，可通过`zarr`库加载。示例数据维度与坐标信息如下： Size: 283kB Dimensions: (band: 2, time: 1, y: 264, x: 264) Coordinates: * band (band) = 0.16. ### 下载你可以通过Hugging Face命令行接口（CLI）工具下载该数据集。请注意完整数据集需占用17TB存储空间。 shell hf download ibm-esa-geospatial/TerraMesh --repo-type dataset --local-dir data/TerraMesh 若仅需下载部分数据，可通过`--include`参数指定下载范围： # 仅下载验证集数据 hf download ibm-esa-geospatial/TerraMesh --repo-type dataset --include "val/*" --local-dir data/TerraMesh # 仅下载单一模态数据（例如S2L2A） hf download ibm-esa-geospatial/TerraMesh --repo-type dataset --include "*/S2L2A/*" --local-dir data/TerraMesh 建议你先登录Hugging Face CLI工具（可通过`hf auth whoami`命令检查登录状态），以避免触发下载限制，同时可通过`--max-worker 4`参数减少工作进程数量。若下载中断，只需重新运行命令即可自动续传。 ### 数据加载器我们在`terramesh.py`中提供了数据加载代码，该文件会随数据集一同下载。若需在开发环境中使用流式加载，你可以通过以下[链接](https://huggingface.co/datasets/ibm-esa-geospatial/TerraMesh/resolve/main/terramesh.py)下载该文件，或通过以下命令获取： wget https://huggingface.co/datasets/ibm-esa-geospatial/TerraMesh/resolve/main/terramesh.py 你可以使用`build_terramesh_dataset`函数初始化数据集，该函数依托WebDataset库加载分片文件中的样本。你可以通过Hugging Face的URL流式加载数据，也可以下载完整数据集并传入本地路径（例如`data/TerraMesh/`）。 python from terramesh import build_terramesh_dataset from torch.utils.data import DataLoader # 若仅传入单一模态，该模态将以"image"键名加载 dataset = build_terramesh_dataset( path="https://huggingface.co/datasets/ibm-esa-geospatial/TerraMesh/resolve/main/", # 流式加载或本地路径 modalities=["S2L2A"], split="val", shuffle=False, # 验证集请设置为False batch_size=8 ) # 批次键名：["__key__", "__url__", "image"] # 若传入多模态数据，将以模态名作为键名返回样本 dataset = build_terramesh_dataset( path="https://huggingface.co/datasets/ibm-esa-geospatial/TerraMesh/resolve/main/", # 流式加载或本地路径 modalities=["S2L2A", "S2L1C", "S2RGB", "S1GRD", "S1RTC", "DEM", "NDVI", "LULC"], shuffle=False, # 验证集请设置为False split="val", batch_size=8 ) # 由于WebDataset已处理批处理，需将batch_size设为None。 dataloader = DataLoader(dataset, batch_size=None, num_workers=4, persistent_workers=True, prefetch_factor=1) # 遍历数据加载器 for batch in dataloader: print("Batch keys:", list(batch.keys())) # 批次键名：["__key__", "__url__", "S2L2A", "S2L1C", "S2RGB", "S1RTC", "DEM", "NDVI", "LULC"] # 由于并非所有样本都包含S1RTC和S1GRD，每个批次仅包含其中一种S1数据。 print("Data shape:", batch["S2L2A"].shape) # 数据形状：torch.Size([8, 12, 264, 264]) # 维度顺序为 [batch, channel, h, w]。代码已移除源数据中的时间维度。 break ### 数据变换我们提供了用于封装`albumentations`变换函数的额外代码。之所以推荐使用albumentations，是因为它可以为所有图像模态共享变换参数（例如统一的随机裁剪操作）。不过需要进行一定的代码封装，才能将数据转换为其要求的格式。 python import albumentations as A from albumentations.pytorch import ToTensorV2 from terramesh import build_terramesh_dataset, Transpose, MultimodalTransforms, MultimodalNormalize, statistics # 定义所有图像模态 modalities = ["S2L2A", "S2L1C", "S2RGB", "S1GRD", "S1RTC", "DEM", "NDVI", "LULC"] # 定义多模态变换函数，将数据转换为albumentations要求的格式 val_transform = MultimodalTransforms( transforms=A.Compose([ # 推荐使用albumentations以实现多模态共享变换 Transpose([1, 2, 0]), # 将数据转换为通道最后格式（albumentations要求的输入形状） MultimodalNormalize(mean=statistics["mean"], std=statistics["std"]), A.CenterCrop(224, 224), # 验证集使用中心裁剪 # A.RandomCrop(224, 224), # 训练集使用随机裁剪 # A.D4(), # 训练集可选择使用随机翻转与旋转增强 ToTensorV2(), # 转换为张量并恢复为通道优先格式 ], is_check_shapes=False, # TerraMesh的数据已对齐，无需检查形状 additional_targets={m: "image" for m in modalities} ), non_image_modalities=["__key__", "__url__"], # 非图像键名 ) dataset = build_terramesh_dataset( path="https://huggingface.co/datasets/ibm-esa-geospatial/TerraMesh/resolve/main/", modalities=modalities, split="val", transform=val_transform, batch_size=8, ) 若仅使用单一模态，则无需指定`additional_targets`参数，此时需要将归一化部分修改为： MultimodalNormalize( mean={"image": statistics["mean"][""]}, std={"image": statistics["std"][""]} ), ### 返回元数据你可以向`build_terramesh_dataset()`函数传入`return_metadata=True`参数，以额外加载样本的中心经纬度、时间戳以及哨兵2号影像的云掩膜。此时生成的批次键名将包含：`["__key__", "__url__", "S2L2A", "S1RTC", ..., "center_lon", "center_lat", "cloud_mask", "time_S2L2A", "time_S1RTC", ...]`。因此，若你使用了数据变换函数，需要同步更新变换配置： python val_transform = MultimodalTransforms( transforms=A.Compose([...], additional_targets={m: "image" for m in modalities + ["cloud_mask"]} ), non_image_modalities=["__key__", "__url__", "center_lon", "center_lat"] + ["time_" + m for m in modalities] ) 对于单一模态的数据集，时间戳键名不带有后缀，此时需要对变换配置进行如下修改： python val_transform = MultimodalTransforms( transforms=A.Compose([...], additional_targets={"cloud_mask": "image"} ), non_image_modalities=["__key__", "__url__", "center_lon", "center_lat", "time"] ) 请注意，当使用随机裁剪时，中心坐标不会随之更新。云掩膜包含以下类别：陆地（0）、水体（1）、积雪（2）、薄云（3）、厚云（4）、云阴影（5）以及无数据（6）。数字高程模型（DEM）不返回时间戳，而土地覆盖（LULC）数据则使用哨兵2号影像的时间戳，因为其增强操作依托哨兵2号的云掩膜与冰掩膜实现。时间戳以整数形式返回，可通过以下代码转换为datetime格式： python batch["time_S2L2A"].numpy().astype("datetime64[ns]") 若你在数据加载过程中遇到任何问题，请在社区板块发起讨论并@`@blumenstiel`。 --- ## 引用若你在研究中使用TerraMesh，请引用以下文献： bibtex @article{blumenstiel2025terramesh, title={Terramesh: A planetary mosaic of multimodal earth observation data}, author={Blumenstiel, Benedikt and Fraccaro, Paolo and Marsocci, Valerio and Jakubik, Johannes and Maurogiovanni, Stefano and Czerkawski, Mikolaj and Sedona, Rocco and Cavallaro, Gabriele and Brunschwiler, Thomas and Bernabe-Moreno, Juan and others}, journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, year={2025}, } --- ## 许可协议 TerraMesh采用**知识共享署名-相同方式共享4.0（CC‑BY‑SA‑4.0）**许可协议发布。 --- ## 致谢 TerraMesh是由欧洲空间局Φ‑Lab资助的**FAST‑EO**项目的一部分（合同编号#4000143501/23/I‑DT）。该数据集的卫星影像数据（S2L1C、S2L2A、S1GRD、S1RTC）来源于[SSL4EO‑S12 v1.1](https://huggingface.co/datasets/embed2scale/SSL4EO-S12-v1.1)（CC-BY-4.0许可）与[MajorTOM‑Core](https://huggingface.co/Major-TOM)（CC-BY-SA-4.0许可）数据集。土地覆盖（LULC）数据由[ESRI、Impact Observatory与微软](https://planetarycomputer.microsoft.com/dataset/io-lulc-annual-v02)提供（CC-BY-4.0许可）。用于增强土地覆盖地图的云掩膜以元数据形式提供，由[SEnSeIv2](https://github.com/aliFrancis/SEnSeIv2/tree/main)模型生成。数字高程模型（DEM）数据由[哥白尼WorldDEM-30](https://dataspace.copernicus.eu/explore-data/data-collections/copernicus-contributing-missions/collections-description/COP-DEM)生成，© 德国航天中心（DLR e.V.）2010-2014，© 空客防务与航天有限公司（Airbus Defence and Space GmbH）2014-2018，由欧盟与欧洲空间局根据哥白尼计划提供；保留所有权利。

提供机构：

maas

创建时间：

2025-08-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集