alexroz/CarbonBench
收藏Hugging Face2026-02-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/alexroz/CarbonBench
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: target_fluxes
features:
- name: TIMESTAMP
dtype: date32
- name: GPP_NT_VUT_USTAR50
dtype: float64
- name: RECO_NT_VUT_USTAR50
dtype: float64
- name: NEE_VUT_USTAR50
dtype: float64
- name: NEE_VUT_USTAR50_QC
dtype: float64
- name: site
dtype: string
- name: lat
dtype: float64
- name: lon
dtype: float64
- name: IGBP
dtype: string
- config_name: MOD09GA
features:
- name: date
dtype: date32
- name: site
dtype: string
- name: sur_refl_b01
dtype: float64
- name: sur_refl_b02
dtype: float64
- name: sur_refl_b03
dtype: float64
- name: sur_refl_b04
dtype: float64
- name: sur_refl_b05
dtype: float64
- name: sur_refl_b06
dtype: float64
- name: sur_refl_b07
dtype: float64
- name: SensorZenith
dtype: float64
- name: SensorAzimuth
dtype: float64
- name: SolarZenith
dtype: float64
- name: SolarAzimuth
dtype: float64
- name: clouds
dtype: float64
- config_name: ERA5
features:
- name: date
dtype: date32
- name: site
dtype: string
configs:
- config_name: target_fluxes
data_files: target_fluxes.parquet
- config_name: MOD09GA
data_files: MOD09GA.parquet
- config_name: ERA5
data_files: ERA5.parquet
license: mit
task_categories:
- tabular-regression
- time-series-forecasting
tags:
- carbon-fluxes
- eddy-covariance
- remote-sensing
- climate
- ecology
- zero-shot
- MODIS
- ERA5
- FLUXNET
size_categories:
- 1M<n<10M
---
# CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning
CarbonBench comprises over **1.3 million daily observations** from **573 eddy covariance flux tower sites** globally (2000–2024). It provides stratified evaluation protocols that explicitly test generalization across unseen vegetation types and climate regimes, a harmonized set of remote sensing and meteorological features, and reproducible baselines ranging from tree-based methods to domain-generalization architectures.
**Paper:** [CarbonBench (KDD 2025)]()
**Code:** [github.com/alexxxroz/CarbonBench](https://github.com/alexxxroz/CarbonBench)
## Dataset Summary
| Property | Value |
|----------|-------|
| Daily observations | 1,405,813 |
| Flux tower sites | 573 |
| Date range | 2000–2024 |
| Source networks | FLUXNET2015, AmeriFlux, ICOS, JapanFlux |
| IGBP vegetation classes | 16 |
| Köppen climate classes | 5 (main) / 30 (detailed) |
## Data Files
| File | Description | Size |
|------|-------------|------|
| `target_fluxes.parquet` | Carbon flux targets + site metadata | ~43 MB |
| `MOD09GA.parquet` | MODIS MOD09GA surface reflectance features | ~474 MB |
| `ERA5.parquet` | ERA5-Land meteorological features | ~5 GB |
| `koppen_sites.json` | Site → Köppen climate classification mapping | ~11 KB |
| `feature_sets.json` | ERA5 feature set definitions (minimal/standard/full) | ~7 KB |
| `FLUXNET2015_Metadata.csv` | FLUXNET2015 site metadata | ~18 KB |
| `AmeriFlux_Metadata.tsv` | AmeriFlux site metadata | ~210 KB |
| `ICOS2025_Metadata.csv` | ICOS site metadata | ~3 KB |
## Prediction Targets
All targets are derived from eddy covariance measurements standardized under the ONEFlux methodology (units: gC m⁻² day⁻¹):
| Target | Column | Description |
|--------|--------|-------------|
| GPP | `GPP_NT_VUT_USTAR50` | Gross Primary Production |
| RECO | `RECO_NT_VUT_USTAR50` | Ecosystem Respiration |
| NEE | `NEE_VUT_USTAR50` | Net Ecosystem Exchange (NEE = −GPP + RECO) |
Each observation includes a continuous quality control flag: `NEE_VUT_USTAR50_QC` (0–1).
## Features
### MODIS MOD09GA (12 features)
Seven surface reflectance bands (`sur_refl_b01`–`sur_refl_b07`), sensor/solar geometry (`SensorZenith`, `SensorAzimuth`, `SolarZenith`, `SolarAzimuth`), and cloud fraction (`clouds`).
### ERA5-Land (6 / 36 / 150 features)
Three configurable feature sets defined in `feature_sets.json`:
- **Minimal (6):** temperature, precipitation, radiation, evaporation, LAI (high & low vegetation)
- **Standard (36):** minimal + soil temperature/moisture (4 levels), wind, pressure, snow/albedo, radiation components, runoff
- **Full (150):** standard + lake variables, additional flux components, min/max daily variants
### Site Metadata (5 features)
Latitude, longitude, IGBP vegetation type (16 classes), Köppen climate class (5 main / 30 detailed).
## Train-Test Splits
CarbonBench provides two complementary **site-holdout** splits for zero-shot evaluation (random state = 56):
- **IGBP-stratified:** partitioned by vegetation type. 80/20 for common classes (>10 sites), 50/50 for rare classes (≤10 sites).
- **Köppen-stratified:** partitioned by climate zone. Uniform 80/20 split across 5 main classes.
All splits are at the site level — train and test sites are mutually exclusive.
## Usage
### Download
```bash
pip install huggingface_hub
huggingface-cli download alexroz/CarbonBench --repo-type dataset --local-dir data
```
### With the `carbonbench` package
```python
import carbonbench
targets = ['GPP_NT_VUT_USTAR50', 'RECO_NT_VUT_USTAR50', 'NEE_VUT_USTAR50']
y = carbonbench.load_targets(targets, include_qc=True)
y_train, y_test = carbonbench.split_targets(y, split_type='Koppen')
modis = carbonbench.load_modis()
era = carbonbench.load_era('minimal')
train, val, test, x_scaler, y_scaler = carbonbench.join_features(
y_train, y_test, modis, era, scale=True
)
```
### Direct loading with pandas
```python
import pandas as pd
targets = pd.read_parquet("data/target_fluxes.parquet")
modis = pd.read_parquet("data/MOD09GA.parquet")
era5 = pd.read_parquet("data/ERA5.parquet")
```
## Evaluation
All metrics are computed **per-site**, then reported as quantiles (25th, median, 75th percentile):
| Metric | Description |
|--------|-------------|
| R² | Coefficient of determination |
| RMSE | Root mean squared error (gC m⁻² day⁻¹) |
| nMAE | Mean absolute error normalized by site mean flux |
| RAE | Relative absolute error |
## Citation
<!-- ```bibtex
@inproceedings{rozanov2025carbonbench,
title={CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning},
author={Rozanov, Aleksei and Renganathan, Arvind and Zhang, Yimeng and Kumar, Vipin},
booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
year={2025}
}
``` -->
## License
MIT
数据集信息:
- 配置名称: target_fluxes
特征:
- 名称: TIMESTAMP
数据类型: date32
- 名称: GPP_NT_VUT_USTAR50
数据类型: float64
- 名称: RECO_NT_VUT_USTAR50
数据类型: float64
- 名称: NEE_VUT_USTAR50
数据类型: float64
- 名称: NEE_VUT_USTAR50_QC
数据类型: float64
- 名称: site(站点)
数据类型: string
- 名称: lat(纬度)
数据类型: float64
- 名称: lon(经度)
数据类型: float64
- 名称: IGBP(国际地圈生物圈计划)
数据类型: string
- 配置名称: MOD09GA
特征:
- 名称: date(日期)
数据类型: date32
- 名称: site(站点)
数据类型: string
- 名称: sur_refl_b01
数据类型: float64
- 名称: sur_refl_b02
数据类型: float64
- 名称: sur_refl_b03
数据类型: float64
- 名称: sur_refl_b04
数据类型: float64
- 名称: sur_refl_b05
数据类型: float64
- 名称: sur_refl_b06
数据类型: float64
- 名称: sur_refl_b07
数据类型: float64
- 名称: SensorZenith(传感器天顶角)
数据类型: float64
- 名称: SensorAzimuth(传感器方位角)
数据类型: float64
- 名称: SolarZenith(太阳天顶角)
数据类型: float64
- 名称: SolarAzimuth(太阳方位角)
数据类型: float64
- 名称: clouds(云量)
数据类型: float64
- 配置名称: ERA5
特征:
- 名称: date(日期)
数据类型: date32
- 名称: site(站点)
数据类型: string
配置项:
- 配置名称: target_fluxes
数据文件: target_fluxes.parquet
- 配置名称: MOD09GA
数据文件: MOD09GA.parquet
- 配置名称: ERA5
数据文件: ERA5.parquet
许可证: MIT
任务类别:
- 表格回归(tabular-regression)
- 时间序列预测(time-series-forecasting)
标签:
- 碳通量(carbon-fluxes)
- 涡度相关(eddy-covariance)
- 遥感(remote-sensing)
- 气候(climate)
- 生态学(ecology)
- 零样本(zero-shot)
- MODIS
- ERA5
- FLUXNET
样本量类别:
- 1M<n<10M
# CarbonBench: 面向零样本学习的全球碳通量尺度上推基准数据集
CarbonBench包含来自全球573个涡度相关(eddy-covariance)通量塔站点的130余万条每日观测数据,时间跨度为2000年至2024年。该数据集提供分层评估协议,可显式测试模型在未见植被类型与气候区下的泛化能力;同时包含统一标准化的遥感与气象特征集,以及从树模型到领域泛化架构的可复现基线。
**论文**: [CarbonBench(KDD 2025)]()
**代码**: [github.com/alexxxroz/CarbonBench](https://github.com/alexxxroz/CarbonBench)
## 数据集概览
| 属性 | 数值 |
|----------|-------|
| 每日观测数 | 1,405,813 |
| 通量塔站点数 | 573 |
| 时间范围 | 2000–2024 |
| 源数据网络 | FLUXNET2015、AmeriFlux、ICOS、JapanFlux |
| IGBP植被类别数 | 16 |
| 柯本(Köppen)气候类别数 | 5(大类)/30(细分) |
## 数据文件列表
| 文件名称 | 描述 | 大小 |
|------|-------------|------|
| `target_fluxes.parquet` | 碳通量目标值 + 站点元数据 | 约43 MB |
| `MOD09GA.parquet` | MODIS MOD09GA地表反射率特征集 | 约474 MB |
| `ERA5.parquet` | ERA5-Land气象特征集 | 约5 GB |
| `koppen_sites.json` | 站点→柯本气候分类映射表 | 约11 KB |
| `feature_sets.json` | ERA5特征集定义(极简/标准/全量) | 约7 KB |
| `FLUXNET2015_Metadata.csv` | FLUXNET2015站点元数据 | 约18 KB |
| `AmeriFlux_Metadata.tsv` | AmeriFlux站点元数据 | 约210 KB |
| `ICOS2025_Metadata.csv` | ICOS站点元数据 | 约3 KB |
## 预测任务目标
所有目标均基于ONEFlux方法标准化的涡度协方差测量结果(单位:gC m⁻² day⁻¹):
| 预测目标 | 对应列名 | 描述 |
|--------|--------|-------------|
| 总初级生产力(GPP, Gross Primary Production) | `GPP_NT_VUT_USTAR50` | 总初级生产力 |
| 生态系统呼吸(RECO, Ecosystem Respiration) | `RECO_NT_VUT_USTAR50` | 生态系统呼吸 |
| 净生态系统交换(NEE, Net Ecosystem Exchange) | `NEE_VUT_USTAR50` | 净生态系统交换(NEE = −GPP + RECO) |
每条观测附带连续型质量控制标记:`NEE_VUT_USTAR50_QC`(取值范围0–1)。
## 特征集
### MODIS MOD09GA(12个特征)
包含7个地表反射率波段(`sur_refl_b01`至`sur_refl_b07`)、传感器/太阳几何参数(`SensorZenith`传感器天顶角、`SensorAzimuth`传感器方位角、`SolarZenith`太阳天顶角、`SolarAzimuth`太阳方位角)以及云量分数(`clouds`)。
### ERA5-Land(6/36/150个特征)
包含`feature_sets.json`中定义的三种可配置特征集:
- **极简集(6个特征)**:气温、降水、辐射、蒸发、高低植被叶面积指数(LAI)
- **标准集(36个特征)**:极简集 + 4层土壤温度/湿度、风速、气压、积雪/反照率、辐射分量、径流
- **全量集(150个特征)**:标准集 + 湖泊变量、额外通量分量、每日极值变体
### 站点元数据(5个特征)
纬度、经度、IGBP植被类型(共16类)、柯本气候分类(5大类/30细分类)。
## 训练测试拆分
CarbonBench提供两种互补的**站点留一**拆分方案用于零样本(zero-shot)评估(随机种子=56):
- **IGBP分层拆分**:按植被类型划分。对于站点数>10的常见类别,采用80/20拆分;对于站点数≤10的稀有类别,采用50/50拆分。
- **柯本分层拆分**:按气候区划分。在5个主要气候类别中采用统一的80/20拆分。
所有拆分均基于站点级别——训练集与测试集站点完全互斥。
## 使用方式
### 数据下载
bash
pip install huggingface_hub
huggingface-cli download alexroz/CarbonBench --repo-type dataset --local-dir data
### 借助`carbonbench`工具包
python
import carbonbench
targets = ['GPP_NT_VUT_USTAR50', 'RECO_NT_VUT_USTAR50', 'NEE_VUT_USTAR50']
y = carbonbench.load_targets(targets, include_qc=True)
y_train, y_test = carbonbench.split_targets(y, split_type='Koppen')
modis = carbonbench.load_modis()
era = carbonbench.load_era('minimal')
train, val, test, x_scaler, y_scaler = carbonbench.join_features(
y_train, y_test, modis, era, scale=True
)
### 直接通过Pandas加载数据
python
import pandas as pd
targets = pd.read_parquet("data/target_fluxes.parquet")
modis = pd.read_parquet("data/MOD09GA.parquet")
era5 = pd.read_parquet("data/ERA5.parquet")
## 评估方案
所有指标均**按站点计算**,最终以分位数形式报告(25th、中位数、75th百分位数):
| 评价指标 | 描述 |
|--------|-------------|
| R² | 决定系数 |
| RMSE | 均方根误差(单位:gC m⁻² day⁻¹) |
| nMAE | 归一化平均绝对误差(以站点平均通量为基准) |
| RAE | 相对绝对误差 |
## 引用格式
bibtex
@inproceedings{rozanov2025carbonbench,
title={CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning},
author={Rozanov, Aleksei and Renganathan, Arvind and Zhang, Yimeng and Kumar, Vipin},
booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
year={2025}
}
## 许可证
MIT
提供机构:
alexroz



