Major-TOM/index
收藏Hugging Face2026-04-10 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Major-TOM/index
下载链接
链接失效反馈官方服务:
资源简介:
# Major TOM — Index
The Major TOM Index is a global metadata catalog for the [Major TOM](https://github.com/ESA-PhiLab/Major-TOM) grid at 10 km resolution. It provides a single entry point to discover, filter, and select tiles across sensors, locations, and time without downloading any imagery.
The index covers over 5 million tiles spanning the entire Earth. Each tile corresponds to a 1056 × 1056 px patch (10.56 × 10.56 km) aligned to Sentinel-2 MGRS tiles at 10 m resolution. Every tile is enriched with terrain, climate, soil, socioeconomic, and administrative attributes derived from public Earth Engine datasets.
**What can you do with this index?**
- **Find tiles by location.** Filter by country, state, MGRS tile code, or bounding box using the GeoParquet geometry column.
- **Select tiles by environmental criteria.** Want arid, high-elevation tiles? Filter by `climate:precipitation < 200` and `terrain:elevation > 3000`.
- **Stratify sampling for training sets.** Use the enrichment columns to build geographically and environmentally balanced splits for foundation model pretraining.
- **Link to imagery.** The `land_s2` and `land_l8` files include sensor-specific image IDs (`s2:id_gee`, `l8:id_gee`) that point directly to the source products in Google Earth Engine.
- **Use the ELLIOT splits.** The `elliot.parquet` file provides pre-built monotemporal and temporal splits designed for multi-sensor, multi-temporal EO research.
All files are self-contained GeoParquet with ZSTD compression, sorted by `majortom:code_1000km` → `majortom:code_100km` → `id` for efficient spatial predicate pushdown.
## Schema
Columns are organized into namespaces. Each namespace groups related attributes.
### Grid (`majortom:`)
Tile identity and spatial reference within the Major TOM grid system.
| Column | Type | Description |
|--------|------|-------------|
| `id` | string | Unique tile identifier (e.g. `MT10_770U_395R`). |
| `majortom:code_100km` | string | Parent 100 km grid cell. Used for spatial grouping. |
| `majortom:code_1000km` | string | Parent 1000 km grid cell. Used for coarse-level partitioning. |
| `majortom:crs` | string | Native UTM CRS of the tile (e.g. `EPSG:32647`). |
| `majortom:mgrs_tile` | string | MGRS tile code (e.g. `47WNS`). Links to Sentinel-2 tiling grid. |
| `majortom:mgrs_n` | uint8 | Number of overlapping MGRS tiles (1 after deduplication). |
| `majortom:mgrs_candidates` | list\<string\> | All candidate MGRS tiles before deduplication. |
| `majortom:footprint_pct` | float | Percentage of tile covered by the assigned MGRS tile. |
| `majortom:geotransform` | list\<int32\> | Snapped affine geotransform [originX, scaleX, shearX, originY, shearY, scaleY]. |
| `majortom:geotransform_raw` | list\<double\> | Original (unsnapped) affine geotransform. |
### STAC (`stac:`)
Spatial and temporal reference following STAC conventions. Present in `land_s2` and `land_l8` only, where it replaces the `majortom:` grid columns.
| Column | Type | Description |
|--------|------|-------------|
| `stac:crs` | string | Coordinate reference system. |
| `stac:geotransform` | list\<int64\> | Affine geotransform for the image patch. |
| `stac:tensor_shape` | list\<int32\> | Shape of the image tensor [bands, height, width]. |
| `stac:time_start` | int64 | Acquisition start time (Unix timestamp). |
| `stac:time_end` | int64 | Acquisition end time (Unix timestamp). |
### Sentinel-2 (`s2:`)
Sensor metadata for the assigned Sentinel-2 image. Present in `land_s2` only.
| Column | Type | Description |
|--------|------|-------------|
| `s2:id_gee` | string | Google Earth Engine image ID. Use this to fetch the actual imagery. |
| `s2:product_id` | string | ESA product identifier. |
| `s2:spacecraft` | string | Spacecraft name (Sentinel-2A or Sentinel-2B). |
| `s2:processing_baseline` | string | Processing baseline version. |
| `s2:orbit_number` | uint16 | Relative orbit number. |
| `s2:mean_solar_azimuth` | float | Mean solar azimuth angle, averaged across all bands and detectors (degrees). |
| `s2:mean_solar_zenith` | float | Mean solar zenith angle, averaged across all bands and detectors (degrees). |
| `s2:mean_view_azimuth` | float | Mean viewing azimuth angle from band B8 (degrees). |
| `s2:mean_view_zenith` | float | Mean viewing zenith angle from band B8 (degrees). |
| `s2:reflectance_conversion` | float | Reflectance conversion factor (U correction). |
> **Note on solar vs viewing angles.** The sun has a single position relative to the scene, so ESA provides one solar azimuth and one solar zenith averaged across all bands. Viewing angles are different: Sentinel-2 uses a pushbroom sensor where each spectral band has its own detector array in the focal plane, each observing from a slightly different angle. That is why GEE provides per-band viewing angles (`MEAN_INCIDENCE_*_ANGLE_B1` through `_B12`). We use band B8 (NIR, 10 m) as the reference because it is at native 10 m resolution and sits near the center of the focal plane, making it a representative proxy for the viewing geometry of the 10 m and 20 m bands.
### Landsat 8/9 (`l8:`)
Sensor metadata for the assigned Landsat image. Present in `land_l8` only.
| Column | Type | Description |
|--------|------|-------------|
| `l8:id_gee` | string | Google Earth Engine image ID. Use this to fetch the actual imagery. |
| `l8:product_id` | string | USGS product identifier. |
| `l8:spacecraft` | string | Spacecraft name (Landsat 8 or Landsat 9). |
| `l8:collection_number` | uint8 | USGS Collection number. |
| `l8:collection_category` | string | Collection category (T1, T2, RT). |
| `l8:processing_software` | string | Processing software version. |
| `l8:wrs_path` | uint16 | WRS-2 path number. |
| `l8:wrs_row` | uint16 | WRS-2 row number. |
| `l8:cloud_cover` | float | Scene cloud cover percentage. |
| `l8:sun_azimuth` | float | Sun azimuth angle (degrees). |
| `l8:sun_elevation` | float | Sun elevation angle (degrees). |
| `l8:earth_sun_distance` | float | Earth-Sun distance (astronomical units). |
| `l8:image_quality_oli` | uint8 | OLI image quality score. |
| `l8:roll_angle` | float | Spacecraft roll angle (degrees). |
### Terrain (`terrain:`)
| Column | Type | Range | Description |
|--------|------|-------|-------------|
| `terrain:elevation` | float | ~-420 to 8,849 (m) | Mean elevation in meters from the [Copernicus GLO-30 DEM](https://gee-community-catalog.org/projects/glo30/), a 30 m resolution Digital Surface Model derived from TanDEM-X radar satellite data (2011 to 2015). Includes buildings, infrastructure, and vegetation. Uses the EGM2008 vertical datum. |
### Climate (`climate:`)
| Column | Type | Range | Description |
|--------|------|-------|-------------|
| `climate:precipitation` | float | 0+ (mm/year) | Mean annual precipitation estimated from [GPM](https://gpm.nasa.gov/) (Global Precipitation Measurement) satellite data, aggregated as a long-term annual mean. |
| `climate:temperature` | float | ~-40 to 50 (°C) | Mean annual land surface temperature estimated from [MODIS LST](https://developers.google.com/earth-engine/datasets/catalog/MODIS_061_MOD11A1) satellite data, aggregated as a long-term annual mean. |
### Soil (`soil:`)
Surface-layer soil properties from the [OpenLandMap](https://openlandmap.org/) dataset, derived from machine learning predictions on global soil survey data at 250 m resolution.
| Column | Type | Range | Description |
|--------|------|-------|-------------|
| `soil:clay` | float | 0 to 100 (%) | Clay content weight fraction at 0 cm depth. [Source](https://developers.google.com/earth-engine/datasets/catalog/OpenLandMap_SOL_SOL_CLAY-WFRACTION_USDA-3A1A1A_M_v02). |
| `soil:sand` | float | 0 to 100 (%) | Sand content weight fraction at 0 cm depth. [Source](https://developers.google.com/earth-engine/datasets/catalog/OpenLandMap_SOL_SOL_SAND-WFRACTION_USDA-3A1A1A_M_v02). |
| `soil:carbon` | float | 0+ (g/kg) | Soil organic carbon content at 0 cm depth. [Source](https://developers.google.com/earth-engine/datasets/catalog/OpenLandMap_SOL_SOL_ORGANIC-CARBON_USDA-6A1C_M_v02). |
| `soil:bulk_density` | float | 0+ (kg/m³) | Fine-earth bulk density at 0 cm depth. [Source](https://developers.google.com/earth-engine/datasets/catalog/OpenLandMap_SOL_SOL_BULKDENS-FINEEARTH_USDA-4A1H_M_v02). |
| `soil:ph` | float | ~3 to 10 | Soil pH in water at 0 cm depth. [Source](https://developers.google.com/earth-engine/datasets/catalog/OpenLandMap_SOL_SOL_PH-H2O_USDA-4C1A2A_M_v02). |
### Socioeconomic (`socio:`)
| Column | Type | Range | Description |
|--------|------|-------|-------------|
| `socio:gdp` | float | 0+ (USD) | GDP per capita at purchasing power parity (PPP, constant 2021 USD) for the year 2022. From the [Kummu et al. (2025)](https://doi.org/10.1038/s41597-025-04487-x) gridded dataset, downscaled to admin-2 level (43,501 units) at 5 arc-min resolution. [GEE catalog](https://gee-community-catalog.org/projects/gridded_gdp_hdi/). |
| `socio:population` | float | 0+ (people) | Estimated number of people per grid cell from the Meta [High Resolution Settlement Layer](https://dataforgood.facebook.com/dfg/tools/high-resolution-population-density-maps) (HRSL). Uses satellite imagery and census data at ~30 m resolution. |
| `socio:human_modification` | float | 0.0 to 1.0 | Cumulative degree of human modification of terrestrial ecosystems from the [Global Human Modification v3](https://doi.org/10.1038/s41597-025-04892-2) (Theobald et al. 2025). Combines the spatial footprint and intensity of 13 stressors across five categories: settlement, agriculture, transportation, mining/energy, and electrical infrastructure. 0 = no modification, 1 = fully modified. 300 m resolution. [GEE catalog](https://gee-community-catalog.org/projects/ghm-v3/). |
| `socio:cisi` | float | 0.0 to 1.0 | [Critical Infrastructure Spatial Index](https://doi.org/10.1038/s41597-022-01218-4) (Nirandjan et al. 2022). Aggregates OpenStreetMap data on 39 types of critical infrastructure across seven systems: transportation, energy, telecommunication, waste, water, education, and health. 0 = no infrastructure, 1 = highest density. 0.10° resolution. [GEE catalog](https://gee-community-catalog.org/projects/cisi/). |
### Administrative (`admin:`)
Human-readable administrative boundary names resolved from rasterized boundary datasets.
| Column | Type | Description |
|--------|------|-------------|
| `admin:country` | string | Country name. Tiles over ocean/lakes are labeled `Ocean/Sea/Lakes`. |
| `admin:state` | string | State or province name. |
| `admin:district` | string | District or county name. |
### Other
| Column | Type | Description |
|--------|------|-------------|
| `geometry` | binary (WKB) | Tile geometry. All files include GeoParquet metadata for spatial queries. |
| `split` | string | ELLIOT split assignment: `monotemporal` or `temporal`. Present in `elliot.parquet` only. |
## Files
| File | Rows | Columns | Size | Description |
|------|-----:|--------:|-----:|-------------|
| `global.parquet` | 5,055,204 | 26 | 146 MB | Every 10 km tile on Earth. The complete grid with all enrichment columns. |
| `land.parquet` | 2,767,104 | 26 | 91 MB | Tiles covered by land-observing sensors (Sentinel-2 and Landsat). Same schema as global. |
| `land_s2.parquet` | 2,547,253 | 34 | 127 MB | Land tiles with a Sentinel-2 image assigned. Adds `stac:` and `s2:` sensor metadata. |
| `land_l8.parquet` | 2,255,537 | 38 | 97 MB | Land tiles with a Landsat 8/9 image assigned. Adds `stac:` and `l8:` sensor metadata. |
| `elliot.parquet` | 279,166 | 27 | 14 MB | ELLIOT subset with monotemporal and temporal split assignments. Same enrichment as global plus `split` column. |
### Namespace availability per file
| Namespace | global | land | land_s2 | land_l8 | elliot |
|-----------|:------:|:----:|:-------:|:-------:|:------:|
| `majortom:` | ✓ | ✓ | | | ✓ |
| `stac:` | | | ✓ | ✓ | |
| `s2:` | | | ✓ | | |
| `l8:` | | | | ✓ | |
| `terrain:` | ✓ | ✓ | ✓ | ✓ | ✓ |
| `climate:` | ✓ | ✓ | ✓ | ✓ | ✓ |
| `soil:` | ✓ | ✓ | ✓ | ✓ | ✓ |
| `socio:` | ✓ | ✓ | ✓ | ✓ | ✓ |
| `admin:` | ✓ | ✓ | ✓ | ✓ | ✓ |
| `split` | | | | | ✓ |
| `geometry` | ✓ | ✓ | ✓ | ✓ | ✓ |
## Quick Start
### DuckDB
```sql
INSTALL spatial;
LOAD spatial;
-- Count tiles per country in South America
SELECT "admin:country", COUNT(*) as n_tiles
FROM 'https://data.source.coop/majortom/index/land_s2.parquet'
WHERE "admin:country" IN ('Peru', 'Brazil', 'Colombia', 'Chile', 'Argentina')
GROUP BY "admin:country"
ORDER BY n_tiles DESC;
-- Find high-elevation, arid Sentinel-2 tiles
SELECT id, "s2:id_gee", "terrain:elevation", "climate:precipitation"
FROM 'https://data.source.coop/majortom/index/land_s2.parquet'
WHERE "terrain:elevation" > 3000
AND "climate:precipitation" < 200
LIMIT 20;
```
### Pandas
```python
import pandas as pd
# Load land tiles with Sentinel-2 metadata
url = "https://data.source.coop/majortom/index/land_s2.parquet"
df = pd.read_parquet(url)
# Filter by country
peru = df[df["admin:country"] == "Peru"]
print(f"Peru: {len(peru):,} tiles")
# Get ELLIOT splits
elliot = pd.read_parquet(
"https://data.source.coop/majortom/index/elliot.parquet"
)
print(elliot["split"].value_counts())
```
## ELLIOT Splits
The `elliot.parquet` file contains 279,166 tiles selected for the [ELLIOT project](https://elliot-ai.eu/) multi-temporal dataset extension. Tile locations were sampled using hierarchical spherical k-means (530 × 528 = 279,840 clusters) over [AlphaEarth Foundation](https://source.coop/asterisklabs/alphaearth) embeddings to ensure global environmental diversity.
The `split` column defines two subsets:
- **Monotemporal** (250,000 tiles). One cloud-free image per sensor per location. Designed for tasks where spatial coverage matters more than temporal depth: land cover classification, feature extraction, or pretraining foundation models on diverse global scenes.
- **Temporal** (29,166 tiles). Multiple observations per location across time. Designed for tasks that require temporal context: change detection, phenology tracking, seasonal compositing, or training models that learn from multi-temporal sequences. This subset is further divided into monthly cadence (12,500 tiles × 12 timesteps) and five-daily cadence (16,666 tiles × 6 timesteps).
## License
This dataset is released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/).
## Citation
```bibtex
@inproceedings{Francis2024MajorTOM,
author = {Francis, Alistair and Czerkawski, Mikolaj},
title = {Major TOM: Expandable Datasets for Earth Observation},
booktitle = {IGARSS 2024 - IEEE International Geoscience and Remote Sensing Symposium},
year = {2024},
pages = {2935--2940},
doi = {10.1109/IGARSS53475.2024.10640760}
}
```
## Acknowledgments
This work was supported by the [ELLIOT project](https://elliot-ai.eu/), funded by the European Union under grant agreement No. 101214398. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union.
<p align="center">
<a href="images/elliot.png"><img src="images/asterisk.png" alt="ELLIOT" height="60"></a>
<a href="images/asterisk.png"><img src="images/elliot.png" alt="Asterisk Labs" height="60"></a>
</p>
提供机构:
Major-TOM



