five

AGBD

收藏
魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/prs-eth/AGBD
下载链接
链接失效反馈
官方服务:
资源简介:
# 🌲 AGBD: A Global-scale Biomass Dataset 🌳 Authors: Ghjulia Sialelli ([gsialelli@ethz.ch](mailto:gsialelli@ethz.ch)), Torben Peters, Jan Wegner, Konrad Schindler Paper: https://huggingface.co/papers/2406.04928 ## 🆕 Updates * The dataset was last modified on **Feb. 26th, 2025** * See the [changelog](changelog.md) for more information about what was updated! ## 🚀 Quickstart To get started quickly with this dataset, use the following code snippet: ⚠️ HuggingFace does not support loading scripts anymore. We are in the process of migrating to a new supported format. Thank you for your patience. In the meantime, please reach out to gsialelli@ethz.ch for dataset access. ```python # Install the datasets library if you haven't already !pip install datasets # Import necessary modules from datasets import load_dataset # Load the dataset dataset = load_dataset('prs-eth/AGBD', trust_remote_code=True, streaming=True)["train"] # Options: "train", "validation", "test" # Iterate over the dataset for sample in dataset: features, label = sample['input'], sample['label'] ``` This code will load the dataset as an `IterableDataset`. You can find more information on how to work with `IterableDataset` objects in the [Hugging Face documentation](https://huggingface.co/docs/datasets/access#iterabledataset). --- ## 📊 Dataset Overview Each sample in the dataset contains a **pair of pre-cropped images** along with their corresponding **biomass labels**. For additional resources, including links to the preprocessed uncropped data, please visit the [project page on GitHub](https://github.com/ghjuliasialelli/AGBD/). ### ⚙️ Load Dataset Options The `load_dataset()` function provides the following configuration options: - **`norm_strat`** (str) : `{'pct', 'mean_std', 'none'}` (default = `'pct'`) The strategy to apply to process the input features. Valid options are: `'pct'`, which applies min-max scaling with the 1st and 99th percentiles of the data; `'mean_std'` which applies Z-score normalization; and `'none'`, which returns the un-processed data. - **`encode_strat`** (str) : `{'sin_cos', 'onehot', 'cat2vec', 'none'}` (default = `'sin_cos'`) The encoding strategy to apply to the land classification (LC) data. Valid options are: `'onehot'`, one-hot encoding; `'sin_cos'`, sine-cosine encoding; `'cat2vec'`, cat2vec transformation based on embeddings pre-computed on the train set. - **`input_features`** (dict) The features to be included in the data, the default values being: ``` {'S2_bands': ['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B09','B11', 'B12'], 'S2_dates' : False, 'lat_lon': True, 'GEDI_dates': False, 'ALOS': True, 'CH': True, 'LC': True, 'DEM': True, 'topo': False} ``` - **`additional_features`** (list) (default = `[]`) A list of additional features the dataset should include. *Refer to the [documentation below](#add-feat-anchor) for more details.* Possible values are: ``` ['s2_num_days', 'gedi_num_days', 'lat', 'lon', 'agbd_se', 'elev_lowes', 'leaf_off_f', 'pft_class', 'region_cla', 'rh98', 'sensitivity', 'solar_elev', 'urban_prop'] ``` This metadata can later be accessed as such: ``` from datasets import load_dataset dataset = load_dataset('AGBD.py',trust_remote_code=True,streaming=True) for sample in dataset['train']: lat = sample['lat'] break ``` - **`patch_size`** (int) (default =`15`) The size of the returned patch (in pixels). The maximum value is **25 pixels**, which corresponds to **250 meters**. --- ### 🖼️ Features Details Each sample consists of a varying number of channels, based on the `input_features` and `encode_strat` options passed to the `load_dataset()` function. The channels are organized as follows: | Feature | Channels | Included by default?| Description | | --- | --- | --- | --- | | **Sentinel-2 bands** | `B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B11, B12` | Y | Sentinel-2 bands, in Surface Reflectance values | | **Sentinel-2 dates** | `s2_num_days, s2_doy_cos, s2_doy_sin` | N | Date of acquisition of the S2 image (in number of days wrt the beginning of the GEDI mission); sine-cosine encoding of the day of year (DOY).| | **Geographical coordinates** | `lat_cos, lat_sin, lon_cos, lon_sin` | Y | Sine-cosine encoding of the latitude and longitude.| | **GEDI dates** | `gedi_num_days, gedi_doy_cos, gedi_doy_sin` | N | Date of acquisition of the GEDI footprint (in number of days wrt the beginning of the GEDI mission); sine-cosine encoding of the day of year (DOY).| | **ALOS PALSAR-2 bands** | `HH,HV` | Y | ALOS PALSAR-2 bands, gamma-naught values in dB.| | **Canopy Height** | `ch, ch_std`| Y | Canopy height from Lang et al. and associated standard deviation. | | **Land Cover Information** | `lc_encoding*, lc_prob`| Y | Encoding of the land class, and classification probability (as a percentage between 0 and 1).| | **Topography** | `slope, aspect_cos, aspect_sin` | N | Slope (percentage between 0 and 1); sine-cosine encoded aspect of the slope.| | **Digital Elevation Model (DEM)** | `dem` | Y | Elevation (in meters).| This corresponds to the following value for `input_features` : ``` {'S2_bands': ['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B09','B11', 'B12'], 'S2_dates' : False, 'lat_lon': True, 'GEDI_dates': False, 'ALOS': True, 'CH': True, 'LC': True, 'DEM': True, 'topo': False} ``` Regarding `lc_encoding*`, the number of channels follows this convention: - `sin_cos` (default) : 2 channels - `cat2vec` : 5 channels - `onehot` : 14 channels - `none` : 1 channel Should you get stuck, you can debug the number of channels using the `compute_num_features()` function in [AGBD.py](AGBD.py). In summary, the channels are structured as follows: ```plaintext (Sentinel-2 bands) | (Sentinel-2 dates) | (Geographical coordinates) | (GEDI dates) | (ALOS PALSAR-2 bands) | (Canopy Height) | (Land Cover Information) | (Topography) | DEM ``` --- ### ➕ Additional Features <a name="add-feat-anchor"></a> You can include a list of additional features from the options below in your dataset configuration: - **`"agbd_se"` - AGBD Standard Error**: The uncertainty estimate associated with the aboveground biomass density prediction for each GEDI footprint. - **`"elev_lowes"` - Elevation**: The height above sea level at the location of the GEDI footprint. - **`"leaf_off_f"` - Leaf-Off Flag**: Indicates whether the measurement was taken during the leaf-off season, which can impact canopy structure data. - **`"pft_class"` - Plant Functional Type (PFT) Class**: Categorization of the vegetation type (e.g., deciduous broadleaf, evergreen needleleaf). - **`"region_cla"` - Region Class**: The geographical area where the footprint is located (e.g., North America, South Asia). - **`"rh98"` - RH98 (Relative Height at 98%)**: The height at which 98% of the returned laser energy is reflected, a key measure of canopy height. - **`"sensitivity"` - Sensitivity**: The proportion of laser pulse energy reflected back to the sensor, providing insight into vegetation density and structure. - **`"solar_elev"` - Solar Elevation**: The angle of the sun above the horizon at the time of measurement, which can affect data quality. - **`"urban_prop"` - Urban Proportion**: The percentage of the footprint area that is urbanized, helping to filter or adjust biomass estimates in mixed landscapes. - **`"gedi_num_days"` - Date of GEDI Footprints**: The specific date on which each GEDI footprint was captured, adding temporal context to the measurements. - **`"s2_num_days"` - Date of Sentinel-2 Image**: The specific date on which each Sentinel-2 image was captured, ensuring temporal alignment with GEDI data. - **`"lat"` - Latitude**: Latitude of the central pixel. - **`"lon"` - Longitude**: Longitude of the central pixel.

# 🌲 AGBD:全球尺度生物量数据集 🌳 作者:Ghjulia Sialelli([gsialelli@ethz.ch](mailto:gsialelli@ethz.ch))、Torben Peters、Jan Wegner、Konrad Schindler 论文:https://huggingface.co/papers/2406.04928 ## 🆕 更新 * 本数据集最后更新于**2025年2月26日** * 如需了解更新详情,请查阅[更新日志](changelog.md)。 ## 🚀 快速上手 如需快速使用本数据集,请参考以下代码片段: ⚠️ HuggingFace现已不再支持脚本加载方式,我们正迁移至新的兼容格式。感谢您的耐心等待。在此期间,请通过gsialelli@ethz.ch联系以获取数据集访问权限。 python # Install the datasets library if you haven't already !pip install datasets # Import necessary modules from datasets import load_dataset # Load the dataset dataset = load_dataset('prs-eth/AGBD', trust_remote_code=True, streaming=True)["train"] # Options: "train", "validation", "test" # Iterate over the dataset for sample in dataset: features, label = sample['input'], sample['label'] 此代码将数据集加载为`IterableDataset`(可迭代数据集,IterableDataset)。如需了解更多操作`IterableDataset`对象的方法,请参阅[Hugging Face官方文档](https://huggingface.co/docs/datasets/access#iterabledataset)。 --- ## 📊 数据集概览 本数据集的每个样本均包含**一对经过预裁剪的图像**及其对应的**生物量标签**。如需获取包括预处理后未裁剪数据在内的更多资源,请访问GitHub上的[项目主页](https://github.com/ghjuliasialelli/AGBD/)。 ### ⚙️ 数据集加载配置项 `load_dataset()` 函数提供以下配置选项: - **`norm_strat`** (字符串):可选值为`{'pct', 'mean_std', 'none'}`,默认值为`'pct'` 用于处理输入特征的归一化策略。各选项说明如下:`'pct'`表示基于数据集1%和99%分位数进行最小-最大缩放;`'mean_std'`表示执行Z-score标准化;`'none'`则返回未经过处理的原始数据。 - **`encode_strat`** (字符串):可选值为`{'sin_cos', 'onehot', 'cat2vec', 'none'}`,默认值为`'sin_cos'` 用于土地分类(Land Classification, LC)数据的编码策略。各选项说明如下:`'onehot'`表示独热编码(one-hot encoding);`'sin_cos'`表示正弦-余弦编码(sin_cos);`'cat2vec'`表示基于训练集预计算的嵌入向量执行cat2vec变换;`'none'`表示不进行编码。 - **`input_features`** (字典) 用于指定需包含在数据中的特征,默认配置如下: python {'S2_bands': ['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B09','B11', 'B12'], 'S2_dates' : False, 'lat_lon': True, 'GEDI_dates': False, 'ALOS': True, 'CH': True, 'LC': True, 'DEM': True, 'topo': False} - **`additional_features`** (列表),默认值为`[]` 用于指定数据集需额外包含的特征列表。*详细说明请参阅下文[附加特征](#add-feat-anchor)部分*。可选特征如下: python ['s2_num_days', 'gedi_num_days', 'lat', 'lon', 'agbd_se', 'elev_lowes', 'leaf_off_f', 'pft_class', 'region_cla', 'rh98', 'sensitivity', 'solar_elev', 'urban_prop'] 可通过如下方式访问该元数据: python from datasets import load_dataset dataset = load_dataset('AGBD.py',trust_remote_code=True,streaming=True) for sample in dataset['train']: lat = sample['lat'] break - **`patch_size`** (整数),默认值为`15` 返回图像块的尺寸(单位:像素),最大值为**25像素**,对应**250米**的实地范围。 --- ### 🖼️ 特征详情 每个样本的通道数会根据`load_dataset()`函数传入的`input_features`和`encode_strat`配置项动态变化,通道组织规则如下: | 特征名称 | 通道组成 | 默认包含? | 特征说明 | | --- | --- | --- | --- | | **哨兵二号(Sentinel-2)波段** | `B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B11, B12` | 是 | 哨兵二号(Sentinel-2)地表反射率波段 | | **哨兵二号日期信息** | `s2_num_days, s2_doy_cos, s2_doy_sin` | 否 | Sentinel-2图像的采集日期(以GEDI任务启动日为基准的天数);以及一年中第几日(DOY)的正弦-余弦编码 | | **地理坐标** | `lat_cos, lat_sin, lon_cos, lon_sin` | 是 | 纬度和经度的正弦-余弦编码 | | **GEDI(地球科学激光测高卫星)日期信息** | `gedi_num_days, gedi_doy_cos, gedi_doy_sin` | 否 | GEDI足迹的采集日期(以GEDI任务启动日为基准的天数);以及一年中第几日(DOY)的正弦-余弦编码 | | **ALOS PALSAR-2 波段** | `HH, HV` | 是 | ALOS PALSAR-2 波段,单位为dB的gamma-naught值 | | **冠层高度** | `ch, ch_std` | 是 | Lang等人发布的冠层高度数据及其标准差 | | **土地覆盖信息** | `lc_encoding*, lc_prob` | 是 | 土地类别编码与分类概率(取值范围为0~1的百分比值) | | **地形特征** | `slope, aspect_cos, aspect_sin` | 否 | 坡度(取值范围为0~1的百分比值);坡向的正弦-余弦编码 | | **数字高程模型(DEM)** | `dem` | 是 | 海拔高度(单位:米) | 此处对应默认的`input_features`配置: python {'S2_bands': ['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B09','B11', 'B12'], 'S2_dates' : False, 'lat_lon': True, 'GEDI_dates': False, 'ALOS': True, 'CH': True, 'LC': True, 'DEM': True, 'topo': False} 关于`lc_encoding*`,其通道数遵循以下规则: - `sin_cos`(默认配置):2个通道 - `cat2vec`:5个通道 - `onehot`:14个通道 - `none`:1个通道 如遇通道数相关问题,可通过[AGBD.py](AGBD.py)中的`compute_num_features()`函数进行调试。 简言之,通道的组织结构如下: plaintext (哨兵二号波段) | (哨兵二号日期信息) | (地理坐标) | (GEDI日期信息) | (ALOS PALSAR-2 波段) | (冠层高度) | (土地覆盖信息) | (地形特征) | DEM --- ### ➕ 附加特征 <a name="add-feat-anchor"></a> 您可在数据集配置中加入以下可选的附加特征: - **`"agbd_se"` - AGBD标准误差**:每个GEDI足迹对应的地上生物量密度预测的不确定性估计值。 - **`"elev_lowes"` - 海拔高度**:GEDI足迹所在位置的海平面以上高度。 - **`"leaf_off_f"` - 落叶季标志**:用于指示测量是否在落叶季进行,该因素会对冠层结构数据产生影响。 - **`"pft_class"` - 植物功能类型(PFT)类别**:植被类型的分类结果(例如落叶阔叶植被、常绿针叶植被)。 - **`"region_cla"` - 区域类别**:足迹所在的地理区域(例如北美、南亚)。 - **`"rh98"` - RH98(98%相对高度)**:反射回传感器的激光能量达到98%时对应的高度,是衡量冠层高度的关键指标。 - **`"sensitivity"` - 激光灵敏度**:反射回传感器的激光脉冲能量占总发射能量的比例,可用于反映植被密度与结构特征。 - **`"solar_elev"` - 太阳高度角**:测量时刻太阳相对于地平线的角度,会对数据质量产生影响。 - **`"urban_prop"` - 城市占比**:足迹范围内城市化区域的百分比,可用于过滤或调整混合景观中的生物量估算结果。 - **`"gedi_num_days"` - GEDI足迹采集日期**:每个GEDI足迹的具体捕获日期,为测量结果提供时间上下文。 - **`"s2_num_days"` - Sentinel-2图像采集日期**:每张Sentinel-2图像的具体拍摄日期,确保与GEDI数据的时间对齐。 - **`"lat"` - 纬度**:中心像素的纬度坐标。 - **`"lon"` - 经度**:中心像素的经度坐标。
提供机构:
maas
创建时间:
2025-05-19
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
AGBD是一个全球生物量数据集,包含预裁剪的卫星图像和生物量标签,支持多种特征配置和元数据选项,适用于遥感研究和生态分析。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作