five

Major-TOM/index

收藏
Hugging Face2026-04-10 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Major-TOM/index
下载链接
链接失效反馈
官方服务:
资源简介:
# Major TOM — Index The Major TOM Index is a global metadata catalog for the [Major TOM](https://github.com/ESA-PhiLab/Major-TOM) grid at 10 km resolution. It provides a single entry point to discover, filter, and select tiles across sensors, locations, and time without downloading any imagery. The index covers over 5 million tiles spanning the entire Earth. Each tile corresponds to a 1056 × 1056 px patch (10.56 × 10.56 km) aligned to Sentinel-2 MGRS tiles at 10 m resolution. Every tile is enriched with terrain, climate, soil, socioeconomic, and administrative attributes derived from public Earth Engine datasets. **What can you do with this index?** - **Find tiles by location.** Filter by country, state, MGRS tile code, or bounding box using the GeoParquet geometry column. - **Select tiles by environmental criteria.** Want arid, high-elevation tiles? Filter by `climate:precipitation < 200` and `terrain:elevation > 3000`. - **Stratify sampling for training sets.** Use the enrichment columns to build geographically and environmentally balanced splits for foundation model pretraining. - **Link to imagery.** The `land_s2` and `land_l8` files include sensor-specific image IDs (`s2:id_gee`, `l8:id_gee`) that point directly to the source products in Google Earth Engine. - **Use the ELLIOT splits.** The `elliot.parquet` file provides pre-built monotemporal and temporal splits designed for multi-sensor, multi-temporal EO research. All files are self-contained GeoParquet with ZSTD compression, sorted by `majortom:code_1000km` → `majortom:code_100km` → `id` for efficient spatial predicate pushdown. ## Schema Columns are organized into namespaces. Each namespace groups related attributes. ### Grid (`majortom:`) Tile identity and spatial reference within the Major TOM grid system. | Column | Type | Description | |--------|------|-------------| | `id` | string | Unique tile identifier (e.g. `MT10_770U_395R`). | | `majortom:code_100km` | string | Parent 100 km grid cell. Used for spatial grouping. | | `majortom:code_1000km` | string | Parent 1000 km grid cell. Used for coarse-level partitioning. | | `majortom:crs` | string | Native UTM CRS of the tile (e.g. `EPSG:32647`). | | `majortom:mgrs_tile` | string | MGRS tile code (e.g. `47WNS`). Links to Sentinel-2 tiling grid. | | `majortom:mgrs_n` | uint8 | Number of overlapping MGRS tiles (1 after deduplication). | | `majortom:mgrs_candidates` | list\<string\> | All candidate MGRS tiles before deduplication. | | `majortom:footprint_pct` | float | Percentage of tile covered by the assigned MGRS tile. | | `majortom:geotransform` | list\<int32\> | Snapped affine geotransform [originX, scaleX, shearX, originY, shearY, scaleY]. | | `majortom:geotransform_raw` | list\<double\> | Original (unsnapped) affine geotransform. | ### STAC (`stac:`) Spatial and temporal reference following STAC conventions. Present in `land_s2` and `land_l8` only, where it replaces the `majortom:` grid columns. | Column | Type | Description | |--------|------|-------------| | `stac:crs` | string | Coordinate reference system. | | `stac:geotransform` | list\<int64\> | Affine geotransform for the image patch. | | `stac:tensor_shape` | list\<int32\> | Shape of the image tensor [bands, height, width]. | | `stac:time_start` | int64 | Acquisition start time (Unix timestamp). | | `stac:time_end` | int64 | Acquisition end time (Unix timestamp). | ### Sentinel-2 (`s2:`) Sensor metadata for the assigned Sentinel-2 image. Present in `land_s2` only. | Column | Type | Description | |--------|------|-------------| | `s2:id_gee` | string | Google Earth Engine image ID. Use this to fetch the actual imagery. | | `s2:product_id` | string | ESA product identifier. | | `s2:spacecraft` | string | Spacecraft name (Sentinel-2A or Sentinel-2B). | | `s2:processing_baseline` | string | Processing baseline version. | | `s2:orbit_number` | uint16 | Relative orbit number. | | `s2:mean_solar_azimuth` | float | Mean solar azimuth angle, averaged across all bands and detectors (degrees). | | `s2:mean_solar_zenith` | float | Mean solar zenith angle, averaged across all bands and detectors (degrees). | | `s2:mean_view_azimuth` | float | Mean viewing azimuth angle from band B8 (degrees). | | `s2:mean_view_zenith` | float | Mean viewing zenith angle from band B8 (degrees). | | `s2:reflectance_conversion` | float | Reflectance conversion factor (U correction). | > **Note on solar vs viewing angles.** The sun has a single position relative to the scene, so ESA provides one solar azimuth and one solar zenith averaged across all bands. Viewing angles are different: Sentinel-2 uses a pushbroom sensor where each spectral band has its own detector array in the focal plane, each observing from a slightly different angle. That is why GEE provides per-band viewing angles (`MEAN_INCIDENCE_*_ANGLE_B1` through `_B12`). We use band B8 (NIR, 10 m) as the reference because it is at native 10 m resolution and sits near the center of the focal plane, making it a representative proxy for the viewing geometry of the 10 m and 20 m bands. ### Landsat 8/9 (`l8:`) Sensor metadata for the assigned Landsat image. Present in `land_l8` only. | Column | Type | Description | |--------|------|-------------| | `l8:id_gee` | string | Google Earth Engine image ID. Use this to fetch the actual imagery. | | `l8:product_id` | string | USGS product identifier. | | `l8:spacecraft` | string | Spacecraft name (Landsat 8 or Landsat 9). | | `l8:collection_number` | uint8 | USGS Collection number. | | `l8:collection_category` | string | Collection category (T1, T2, RT). | | `l8:processing_software` | string | Processing software version. | | `l8:wrs_path` | uint16 | WRS-2 path number. | | `l8:wrs_row` | uint16 | WRS-2 row number. | | `l8:cloud_cover` | float | Scene cloud cover percentage. | | `l8:sun_azimuth` | float | Sun azimuth angle (degrees). | | `l8:sun_elevation` | float | Sun elevation angle (degrees). | | `l8:earth_sun_distance` | float | Earth-Sun distance (astronomical units). | | `l8:image_quality_oli` | uint8 | OLI image quality score. | | `l8:roll_angle` | float | Spacecraft roll angle (degrees). | ### Terrain (`terrain:`) | Column | Type | Range | Description | |--------|------|-------|-------------| | `terrain:elevation` | float | ~-420 to 8,849 (m) | Mean elevation in meters from the [Copernicus GLO-30 DEM](https://gee-community-catalog.org/projects/glo30/), a 30 m resolution Digital Surface Model derived from TanDEM-X radar satellite data (2011 to 2015). Includes buildings, infrastructure, and vegetation. Uses the EGM2008 vertical datum. | ### Climate (`climate:`) | Column | Type | Range | Description | |--------|------|-------|-------------| | `climate:precipitation` | float | 0+ (mm/year) | Mean annual precipitation estimated from [GPM](https://gpm.nasa.gov/) (Global Precipitation Measurement) satellite data, aggregated as a long-term annual mean. | | `climate:temperature` | float | ~-40 to 50 (°C) | Mean annual land surface temperature estimated from [MODIS LST](https://developers.google.com/earth-engine/datasets/catalog/MODIS_061_MOD11A1) satellite data, aggregated as a long-term annual mean. | ### Soil (`soil:`) Surface-layer soil properties from the [OpenLandMap](https://openlandmap.org/) dataset, derived from machine learning predictions on global soil survey data at 250 m resolution. | Column | Type | Range | Description | |--------|------|-------|-------------| | `soil:clay` | float | 0 to 100 (%) | Clay content weight fraction at 0 cm depth. [Source](https://developers.google.com/earth-engine/datasets/catalog/OpenLandMap_SOL_SOL_CLAY-WFRACTION_USDA-3A1A1A_M_v02). | | `soil:sand` | float | 0 to 100 (%) | Sand content weight fraction at 0 cm depth. [Source](https://developers.google.com/earth-engine/datasets/catalog/OpenLandMap_SOL_SOL_SAND-WFRACTION_USDA-3A1A1A_M_v02). | | `soil:carbon` | float | 0+ (g/kg) | Soil organic carbon content at 0 cm depth. [Source](https://developers.google.com/earth-engine/datasets/catalog/OpenLandMap_SOL_SOL_ORGANIC-CARBON_USDA-6A1C_M_v02). | | `soil:bulk_density` | float | 0+ (kg/m³) | Fine-earth bulk density at 0 cm depth. [Source](https://developers.google.com/earth-engine/datasets/catalog/OpenLandMap_SOL_SOL_BULKDENS-FINEEARTH_USDA-4A1H_M_v02). | | `soil:ph` | float | ~3 to 10 | Soil pH in water at 0 cm depth. [Source](https://developers.google.com/earth-engine/datasets/catalog/OpenLandMap_SOL_SOL_PH-H2O_USDA-4C1A2A_M_v02). | ### Socioeconomic (`socio:`) | Column | Type | Range | Description | |--------|------|-------|-------------| | `socio:gdp` | float | 0+ (USD) | GDP per capita at purchasing power parity (PPP, constant 2021 USD) for the year 2022. From the [Kummu et al. (2025)](https://doi.org/10.1038/s41597-025-04487-x) gridded dataset, downscaled to admin-2 level (43,501 units) at 5 arc-min resolution. [GEE catalog](https://gee-community-catalog.org/projects/gridded_gdp_hdi/). | | `socio:population` | float | 0+ (people) | Estimated number of people per grid cell from the Meta [High Resolution Settlement Layer](https://dataforgood.facebook.com/dfg/tools/high-resolution-population-density-maps) (HRSL). Uses satellite imagery and census data at ~30 m resolution. | | `socio:human_modification` | float | 0.0 to 1.0 | Cumulative degree of human modification of terrestrial ecosystems from the [Global Human Modification v3](https://doi.org/10.1038/s41597-025-04892-2) (Theobald et al. 2025). Combines the spatial footprint and intensity of 13 stressors across five categories: settlement, agriculture, transportation, mining/energy, and electrical infrastructure. 0 = no modification, 1 = fully modified. 300 m resolution. [GEE catalog](https://gee-community-catalog.org/projects/ghm-v3/). | | `socio:cisi` | float | 0.0 to 1.0 | [Critical Infrastructure Spatial Index](https://doi.org/10.1038/s41597-022-01218-4) (Nirandjan et al. 2022). Aggregates OpenStreetMap data on 39 types of critical infrastructure across seven systems: transportation, energy, telecommunication, waste, water, education, and health. 0 = no infrastructure, 1 = highest density. 0.10° resolution. [GEE catalog](https://gee-community-catalog.org/projects/cisi/). | ### Administrative (`admin:`) Human-readable administrative boundary names resolved from rasterized boundary datasets. | Column | Type | Description | |--------|------|-------------| | `admin:country` | string | Country name. Tiles over ocean/lakes are labeled `Ocean/Sea/Lakes`. | | `admin:state` | string | State or province name. | | `admin:district` | string | District or county name. | ### Other | Column | Type | Description | |--------|------|-------------| | `geometry` | binary (WKB) | Tile geometry. All files include GeoParquet metadata for spatial queries. | | `split` | string | ELLIOT split assignment: `monotemporal` or `temporal`. Present in `elliot.parquet` only. | ## Files | File | Rows | Columns | Size | Description | |------|-----:|--------:|-----:|-------------| | `global.parquet` | 5,055,204 | 26 | 146 MB | Every 10 km tile on Earth. The complete grid with all enrichment columns. | | `land.parquet` | 2,767,104 | 26 | 91 MB | Tiles covered by land-observing sensors (Sentinel-2 and Landsat). Same schema as global. | | `land_s2.parquet` | 2,547,253 | 34 | 127 MB | Land tiles with a Sentinel-2 image assigned. Adds `stac:` and `s2:` sensor metadata. | | `land_l8.parquet` | 2,255,537 | 38 | 97 MB | Land tiles with a Landsat 8/9 image assigned. Adds `stac:` and `l8:` sensor metadata. | | `elliot.parquet` | 279,166 | 27 | 14 MB | ELLIOT subset with monotemporal and temporal split assignments. Same enrichment as global plus `split` column. | ### Namespace availability per file | Namespace | global | land | land_s2 | land_l8 | elliot | |-----------|:------:|:----:|:-------:|:-------:|:------:| | `majortom:` | ✓ | ✓ | | | ✓ | | `stac:` | | | ✓ | ✓ | | | `s2:` | | | ✓ | | | | `l8:` | | | | ✓ | | | `terrain:` | ✓ | ✓ | ✓ | ✓ | ✓ | | `climate:` | ✓ | ✓ | ✓ | ✓ | ✓ | | `soil:` | ✓ | ✓ | ✓ | ✓ | ✓ | | `socio:` | ✓ | ✓ | ✓ | ✓ | ✓ | | `admin:` | ✓ | ✓ | ✓ | ✓ | ✓ | | `split` | | | | | ✓ | | `geometry` | ✓ | ✓ | ✓ | ✓ | ✓ | ## Quick Start ### DuckDB ```sql INSTALL spatial; LOAD spatial; -- Count tiles per country in South America SELECT "admin:country", COUNT(*) as n_tiles FROM 'https://data.source.coop/majortom/index/land_s2.parquet' WHERE "admin:country" IN ('Peru', 'Brazil', 'Colombia', 'Chile', 'Argentina') GROUP BY "admin:country" ORDER BY n_tiles DESC; -- Find high-elevation, arid Sentinel-2 tiles SELECT id, "s2:id_gee", "terrain:elevation", "climate:precipitation" FROM 'https://data.source.coop/majortom/index/land_s2.parquet' WHERE "terrain:elevation" > 3000 AND "climate:precipitation" < 200 LIMIT 20; ``` ### Pandas ```python import pandas as pd # Load land tiles with Sentinel-2 metadata url = "https://data.source.coop/majortom/index/land_s2.parquet" df = pd.read_parquet(url) # Filter by country peru = df[df["admin:country"] == "Peru"] print(f"Peru: {len(peru):,} tiles") # Get ELLIOT splits elliot = pd.read_parquet( "https://data.source.coop/majortom/index/elliot.parquet" ) print(elliot["split"].value_counts()) ``` ## ELLIOT Splits The `elliot.parquet` file contains 279,166 tiles selected for the [ELLIOT project](https://elliot-ai.eu/) multi-temporal dataset extension. Tile locations were sampled using hierarchical spherical k-means (530 × 528 = 279,840 clusters) over [AlphaEarth Foundation](https://source.coop/asterisklabs/alphaearth) embeddings to ensure global environmental diversity. The `split` column defines two subsets: - **Monotemporal** (250,000 tiles). One cloud-free image per sensor per location. Designed for tasks where spatial coverage matters more than temporal depth: land cover classification, feature extraction, or pretraining foundation models on diverse global scenes. - **Temporal** (29,166 tiles). Multiple observations per location across time. Designed for tasks that require temporal context: change detection, phenology tracking, seasonal compositing, or training models that learn from multi-temporal sequences. This subset is further divided into monthly cadence (12,500 tiles × 12 timesteps) and five-daily cadence (16,666 tiles × 6 timesteps). ## License This dataset is released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). ## Citation ```bibtex @inproceedings{Francis2024MajorTOM, author = {Francis, Alistair and Czerkawski, Mikolaj}, title = {Major TOM: Expandable Datasets for Earth Observation}, booktitle = {IGARSS 2024 - IEEE International Geoscience and Remote Sensing Symposium}, year = {2024}, pages = {2935--2940}, doi = {10.1109/IGARSS53475.2024.10640760} } ``` ## Acknowledgments This work was supported by the [ELLIOT project](https://elliot-ai.eu/), funded by the European Union under grant agreement No. 101214398. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union. <p align="center"> <a href="images/elliot.png"><img src="images/asterisk.png" alt="ELLIOT" height="60"></a> &nbsp;&nbsp;&nbsp;&nbsp; <a href="images/asterisk.png"><img src="images/elliot.png" alt="Asterisk Labs" height="60"></a> </p>
提供机构:
Major-TOM
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作