five

alexroz/CarbonBench

收藏
Hugging Face2026-02-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/alexroz/CarbonBench
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: target_fluxes features: - name: TIMESTAMP dtype: date32 - name: GPP_NT_VUT_USTAR50 dtype: float64 - name: RECO_NT_VUT_USTAR50 dtype: float64 - name: NEE_VUT_USTAR50 dtype: float64 - name: NEE_VUT_USTAR50_QC dtype: float64 - name: site dtype: string - name: lat dtype: float64 - name: lon dtype: float64 - name: IGBP dtype: string - config_name: MOD09GA features: - name: date dtype: date32 - name: site dtype: string - name: sur_refl_b01 dtype: float64 - name: sur_refl_b02 dtype: float64 - name: sur_refl_b03 dtype: float64 - name: sur_refl_b04 dtype: float64 - name: sur_refl_b05 dtype: float64 - name: sur_refl_b06 dtype: float64 - name: sur_refl_b07 dtype: float64 - name: SensorZenith dtype: float64 - name: SensorAzimuth dtype: float64 - name: SolarZenith dtype: float64 - name: SolarAzimuth dtype: float64 - name: clouds dtype: float64 - config_name: ERA5 features: - name: date dtype: date32 - name: site dtype: string configs: - config_name: target_fluxes data_files: target_fluxes.parquet - config_name: MOD09GA data_files: MOD09GA.parquet - config_name: ERA5 data_files: ERA5.parquet license: mit task_categories: - tabular-regression - time-series-forecasting tags: - carbon-fluxes - eddy-covariance - remote-sensing - climate - ecology - zero-shot - MODIS - ERA5 - FLUXNET size_categories: - 1M<n<10M --- # CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning CarbonBench comprises over **1.3 million daily observations** from **573 eddy covariance flux tower sites** globally (2000–2024). It provides stratified evaluation protocols that explicitly test generalization across unseen vegetation types and climate regimes, a harmonized set of remote sensing and meteorological features, and reproducible baselines ranging from tree-based methods to domain-generalization architectures. **Paper:** [CarbonBench (KDD 2025)]() **Code:** [github.com/alexxxroz/CarbonBench](https://github.com/alexxxroz/CarbonBench) ## Dataset Summary | Property | Value | |----------|-------| | Daily observations | 1,405,813 | | Flux tower sites | 573 | | Date range | 2000–2024 | | Source networks | FLUXNET2015, AmeriFlux, ICOS, JapanFlux | | IGBP vegetation classes | 16 | | Köppen climate classes | 5 (main) / 30 (detailed) | ## Data Files | File | Description | Size | |------|-------------|------| | `target_fluxes.parquet` | Carbon flux targets + site metadata | ~43 MB | | `MOD09GA.parquet` | MODIS MOD09GA surface reflectance features | ~474 MB | | `ERA5.parquet` | ERA5-Land meteorological features | ~5 GB | | `koppen_sites.json` | Site → Köppen climate classification mapping | ~11 KB | | `feature_sets.json` | ERA5 feature set definitions (minimal/standard/full) | ~7 KB | | `FLUXNET2015_Metadata.csv` | FLUXNET2015 site metadata | ~18 KB | | `AmeriFlux_Metadata.tsv` | AmeriFlux site metadata | ~210 KB | | `ICOS2025_Metadata.csv` | ICOS site metadata | ~3 KB | ## Prediction Targets All targets are derived from eddy covariance measurements standardized under the ONEFlux methodology (units: gC m⁻² day⁻¹): | Target | Column | Description | |--------|--------|-------------| | GPP | `GPP_NT_VUT_USTAR50` | Gross Primary Production | | RECO | `RECO_NT_VUT_USTAR50` | Ecosystem Respiration | | NEE | `NEE_VUT_USTAR50` | Net Ecosystem Exchange (NEE = −GPP + RECO) | Each observation includes a continuous quality control flag: `NEE_VUT_USTAR50_QC` (0–1). ## Features ### MODIS MOD09GA (12 features) Seven surface reflectance bands (`sur_refl_b01`–`sur_refl_b07`), sensor/solar geometry (`SensorZenith`, `SensorAzimuth`, `SolarZenith`, `SolarAzimuth`), and cloud fraction (`clouds`). ### ERA5-Land (6 / 36 / 150 features) Three configurable feature sets defined in `feature_sets.json`: - **Minimal (6):** temperature, precipitation, radiation, evaporation, LAI (high & low vegetation) - **Standard (36):** minimal + soil temperature/moisture (4 levels), wind, pressure, snow/albedo, radiation components, runoff - **Full (150):** standard + lake variables, additional flux components, min/max daily variants ### Site Metadata (5 features) Latitude, longitude, IGBP vegetation type (16 classes), Köppen climate class (5 main / 30 detailed). ## Train-Test Splits CarbonBench provides two complementary **site-holdout** splits for zero-shot evaluation (random state = 56): - **IGBP-stratified:** partitioned by vegetation type. 80/20 for common classes (>10 sites), 50/50 for rare classes (≤10 sites). - **Köppen-stratified:** partitioned by climate zone. Uniform 80/20 split across 5 main classes. All splits are at the site level — train and test sites are mutually exclusive. ## Usage ### Download ```bash pip install huggingface_hub huggingface-cli download alexroz/CarbonBench --repo-type dataset --local-dir data ``` ### With the `carbonbench` package ```python import carbonbench targets = ['GPP_NT_VUT_USTAR50', 'RECO_NT_VUT_USTAR50', 'NEE_VUT_USTAR50'] y = carbonbench.load_targets(targets, include_qc=True) y_train, y_test = carbonbench.split_targets(y, split_type='Koppen') modis = carbonbench.load_modis() era = carbonbench.load_era('minimal') train, val, test, x_scaler, y_scaler = carbonbench.join_features( y_train, y_test, modis, era, scale=True ) ``` ### Direct loading with pandas ```python import pandas as pd targets = pd.read_parquet("data/target_fluxes.parquet") modis = pd.read_parquet("data/MOD09GA.parquet") era5 = pd.read_parquet("data/ERA5.parquet") ``` ## Evaluation All metrics are computed **per-site**, then reported as quantiles (25th, median, 75th percentile): | Metric | Description | |--------|-------------| | R² | Coefficient of determination | | RMSE | Root mean squared error (gC m⁻² day⁻¹) | | nMAE | Mean absolute error normalized by site mean flux | | RAE | Relative absolute error | ## Citation <!-- ```bibtex @inproceedings{rozanov2025carbonbench, title={CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning}, author={Rozanov, Aleksei and Renganathan, Arvind and Zhang, Yimeng and Kumar, Vipin}, booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, year={2025} } ``` --> ## License MIT

数据集信息: - 配置名称: target_fluxes 特征: - 名称: TIMESTAMP 数据类型: date32 - 名称: GPP_NT_VUT_USTAR50 数据类型: float64 - 名称: RECO_NT_VUT_USTAR50 数据类型: float64 - 名称: NEE_VUT_USTAR50 数据类型: float64 - 名称: NEE_VUT_USTAR50_QC 数据类型: float64 - 名称: site(站点) 数据类型: string - 名称: lat(纬度) 数据类型: float64 - 名称: lon(经度) 数据类型: float64 - 名称: IGBP(国际地圈生物圈计划) 数据类型: string - 配置名称: MOD09GA 特征: - 名称: date(日期) 数据类型: date32 - 名称: site(站点) 数据类型: string - 名称: sur_refl_b01 数据类型: float64 - 名称: sur_refl_b02 数据类型: float64 - 名称: sur_refl_b03 数据类型: float64 - 名称: sur_refl_b04 数据类型: float64 - 名称: sur_refl_b05 数据类型: float64 - 名称: sur_refl_b06 数据类型: float64 - 名称: sur_refl_b07 数据类型: float64 - 名称: SensorZenith(传感器天顶角) 数据类型: float64 - 名称: SensorAzimuth(传感器方位角) 数据类型: float64 - 名称: SolarZenith(太阳天顶角) 数据类型: float64 - 名称: SolarAzimuth(太阳方位角) 数据类型: float64 - 名称: clouds(云量) 数据类型: float64 - 配置名称: ERA5 特征: - 名称: date(日期) 数据类型: date32 - 名称: site(站点) 数据类型: string 配置项: - 配置名称: target_fluxes 数据文件: target_fluxes.parquet - 配置名称: MOD09GA 数据文件: MOD09GA.parquet - 配置名称: ERA5 数据文件: ERA5.parquet 许可证: MIT 任务类别: - 表格回归(tabular-regression) - 时间序列预测(time-series-forecasting) 标签: - 碳通量(carbon-fluxes) - 涡度相关(eddy-covariance) - 遥感(remote-sensing) - 气候(climate) - 生态学(ecology) - 零样本(zero-shot) - MODIS - ERA5 - FLUXNET 样本量类别: - 1M<n<10M # CarbonBench: 面向零样本学习的全球碳通量尺度上推基准数据集 CarbonBench包含来自全球573个涡度相关(eddy-covariance)通量塔站点的130余万条每日观测数据,时间跨度为2000年至2024年。该数据集提供分层评估协议,可显式测试模型在未见植被类型与气候区下的泛化能力;同时包含统一标准化的遥感与气象特征集,以及从树模型到领域泛化架构的可复现基线。 **论文**: [CarbonBench(KDD 2025)]() **代码**: [github.com/alexxxroz/CarbonBench](https://github.com/alexxxroz/CarbonBench) ## 数据集概览 | 属性 | 数值 | |----------|-------| | 每日观测数 | 1,405,813 | | 通量塔站点数 | 573 | | 时间范围 | 2000–2024 | | 源数据网络 | FLUXNET2015、AmeriFlux、ICOS、JapanFlux | | IGBP植被类别数 | 16 | | 柯本(Köppen)气候类别数 | 5(大类)/30(细分) | ## 数据文件列表 | 文件名称 | 描述 | 大小 | |------|-------------|------| | `target_fluxes.parquet` | 碳通量目标值 + 站点元数据 | 约43 MB | | `MOD09GA.parquet` | MODIS MOD09GA地表反射率特征集 | 约474 MB | | `ERA5.parquet` | ERA5-Land气象特征集 | 约5 GB | | `koppen_sites.json` | 站点→柯本气候分类映射表 | 约11 KB | | `feature_sets.json` | ERA5特征集定义(极简/标准/全量) | 约7 KB | | `FLUXNET2015_Metadata.csv` | FLUXNET2015站点元数据 | 约18 KB | | `AmeriFlux_Metadata.tsv` | AmeriFlux站点元数据 | 约210 KB | | `ICOS2025_Metadata.csv` | ICOS站点元数据 | 约3 KB | ## 预测任务目标 所有目标均基于ONEFlux方法标准化的涡度协方差测量结果(单位:gC m⁻² day⁻¹): | 预测目标 | 对应列名 | 描述 | |--------|--------|-------------| | 总初级生产力(GPP, Gross Primary Production) | `GPP_NT_VUT_USTAR50` | 总初级生产力 | | 生态系统呼吸(RECO, Ecosystem Respiration) | `RECO_NT_VUT_USTAR50` | 生态系统呼吸 | | 净生态系统交换(NEE, Net Ecosystem Exchange) | `NEE_VUT_USTAR50` | 净生态系统交换(NEE = −GPP + RECO) | 每条观测附带连续型质量控制标记:`NEE_VUT_USTAR50_QC`(取值范围0–1)。 ## 特征集 ### MODIS MOD09GA(12个特征) 包含7个地表反射率波段(`sur_refl_b01`至`sur_refl_b07`)、传感器/太阳几何参数(`SensorZenith`传感器天顶角、`SensorAzimuth`传感器方位角、`SolarZenith`太阳天顶角、`SolarAzimuth`太阳方位角)以及云量分数(`clouds`)。 ### ERA5-Land(6/36/150个特征) 包含`feature_sets.json`中定义的三种可配置特征集: - **极简集(6个特征)**:气温、降水、辐射、蒸发、高低植被叶面积指数(LAI) - **标准集(36个特征)**:极简集 + 4层土壤温度/湿度、风速、气压、积雪/反照率、辐射分量、径流 - **全量集(150个特征)**:标准集 + 湖泊变量、额外通量分量、每日极值变体 ### 站点元数据(5个特征) 纬度、经度、IGBP植被类型(共16类)、柯本气候分类(5大类/30细分类)。 ## 训练测试拆分 CarbonBench提供两种互补的**站点留一**拆分方案用于零样本(zero-shot)评估(随机种子=56): - **IGBP分层拆分**:按植被类型划分。对于站点数>10的常见类别,采用80/20拆分;对于站点数≤10的稀有类别,采用50/50拆分。 - **柯本分层拆分**:按气候区划分。在5个主要气候类别中采用统一的80/20拆分。 所有拆分均基于站点级别——训练集与测试集站点完全互斥。 ## 使用方式 ### 数据下载 bash pip install huggingface_hub huggingface-cli download alexroz/CarbonBench --repo-type dataset --local-dir data ### 借助`carbonbench`工具包 python import carbonbench targets = ['GPP_NT_VUT_USTAR50', 'RECO_NT_VUT_USTAR50', 'NEE_VUT_USTAR50'] y = carbonbench.load_targets(targets, include_qc=True) y_train, y_test = carbonbench.split_targets(y, split_type='Koppen') modis = carbonbench.load_modis() era = carbonbench.load_era('minimal') train, val, test, x_scaler, y_scaler = carbonbench.join_features( y_train, y_test, modis, era, scale=True ) ### 直接通过Pandas加载数据 python import pandas as pd targets = pd.read_parquet("data/target_fluxes.parquet") modis = pd.read_parquet("data/MOD09GA.parquet") era5 = pd.read_parquet("data/ERA5.parquet") ## 评估方案 所有指标均**按站点计算**,最终以分位数形式报告(25th、中位数、75th百分位数): | 评价指标 | 描述 | |--------|-------------| | R² | 决定系数 | | RMSE | 均方根误差(单位:gC m⁻² day⁻¹) | | nMAE | 归一化平均绝对误差(以站点平均通量为基准) | | RAE | 相对绝对误差 | ## 引用格式 bibtex @inproceedings{rozanov2025carbonbench, title={CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning}, author={Rozanov, Aleksei and Renganathan, Arvind and Zhang, Yimeng and Kumar, Vipin}, booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, year={2025} } ## 许可证 MIT
提供机构:
alexroz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作