isp-uv-es/ClimX

Name: isp-uv-es/ClimX
Creator: isp-uv-es
Published: 2026-04-03 08:03:44
License: 暂无描述

Hugging Face2026-04-03 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/isp-uv-es/ClimX

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en pretty_name: "ClimX: extreme-aware climate model emulation" tags: - climate - earth-system-model - machine-learning - emulation - extremes - netcdf license: mit task_categories: - time-series-forecasting - other ---  # ClimX: a challenge for extreme-aware climate model emulation ClimX is a challenge about building **fast and accurate machine learning emulators** for the NorESM2-MM Earth System Model, with evaluation focused on **climate extremes** rather than mean climate alone. ## Dataset summary This dataset contains the **full-resolution** ClimX data in **NetCDF-4** format (targets + forcings, depending on split) with a native grid of \$192 \times 288\$ (about \$1^\circ\$) resolution. It also contains the **lite-resolution** version, with a native grid of \$12 \times 18\$ (about \$16^\circ\$) resolution: - **Lite-resolution**: <1GB, \$16\times\$ spatially coarsened, meant for rapid prototyping. - **Full-resolution**: ~200GB, full-resolution data for large-scale training. ## What you will do (high level) You train an emulator that predicts **daily** 2D fields for 7 surface variables: - `tas`, `tasmax`, `tasmin` - `pr`, `huss`, `psl`, `sfcWind` However, the **benchmark targets are 15 extreme indices** derived from daily temperature and precipitation (ETCCDI-style indices). The daily fields are an **intermediate output** your emulator produces (useful for diagnostics and for computing the indices). Participants must therefore predict the daily target variables first. Direct prediction of the leaderboard indices is not allowed. Conceptually: $$ x_t = g(f_t, f_{t-1}, \dots, f_{t-\alpha}, x_{t-1}, x_{t-2}, \dots, x_{t-\beta}) $$ where \$f_t\$ are forcings (greenhouse gases + aerosols) and \$x_t\$ is the climate state. ## Dataset structure ### Spatial and temporal shape Full-resolution daily fields: - **Historical**: `lat: 192, lon: 288, time: 60224` - **Projections**: `lat: 192, lon: 288, time: 31389` ### Splits and scenarios (official challenge setup) Training uses historical + several SSP scenarios; testing is on the held-out **SSP2-4.5** scenario: - **Train**: historical (1850–2014) + `ssp126`, `ssp370`, `ssp585` (2015–2100) - **Test (held-out)**: `ssp245` (2015–2100) To avoid leakage, **targets for `ssp245` are withheld** in the official evaluation; only the **forcings** are provided for that scenario. The full outputs will be released after the competition. ## Evaluation metric The primary leaderboard metric is the region-wise **normalized Nash–Sutcliffe efficiency (nNSE)**, averaged over 15 climate extreme indices. For each index \$v\$, grid cell \$(i,j)\$, a validity mask \$\mathcal{V}\$ excludes cells with negligible temporal variability. Cell-level \$R^2\$ and nNSE are: $$ R^2_{ij} = 1 - \frac{\mathrm{MSE}_{ij}}{\mathrm{Var}_t(gt_{ij})}, \qquad \mathrm{nNSE}_{ij} = \frac{R^2_{ij}}{2 - R^2_{ij}} $$ For each AR6 land region \$k\$, the area-weighted regional score is: $$ \mathrm{nNSE}_{kv} = \frac{\sum_{(i,j)\in k \cap \mathcal{V}} \cos\phi_i \, \mathrm{nNSE}_{ij}}{\sum_{(i,j)\in k \cap \mathcal{V}} \cos\phi_i} $$ The final score averages uniformly over valid regions and indices: $$ S = \frac{1}{|V|} \sum_{v \in V} \frac{1}{|K_v|} \sum_{k \in K_v} \mathrm{nNSE}_{kv} $$ \$S=1\$ is perfect agreement, \$S=0\$ corresponds to a mean predictor, and \$S \to -1\$ is pathological. ## How to load the data This dataset is distributed as **NetCDF-4** files. There are two common ways to load it. ### Option 1 (recommended): clone the ClimX code and use the helper loader The ClimX repository already includes a helper module (`src/data/climx_hf.py`) that allows you to download the dataset from Hugging Face and open it as three lazily-loaded “virtual” xarray datasets: ```bash git clone https://github.com/IPL-UV/ClimX.git cd ClimX pip install -U "huggingface-hub" xarray netcdf4 dask ``` ```python from src.data.climx_hf import download_climx_from_hf, open_climx_virtual_datasets # Download NetCDF artifacts from HF into a local cache directory. root = download_climx_from_hf("/path/to/hf_cache", variant="full") # Open as three virtual datasets (lazy / dask-friendly). ds = open_climx_virtual_datasets(root, variant="full") # or "lite" ds.hist # historical (targets + forcings) ds.train # projections training scenarios (targets + forcings; excludes `ssp245` scenario) ds.test_forcings # `ssp245` scenario forcings only (no targets) ``` ### Option 2: download NetCDFs and open with xarray directly You can also download files from Hugging Face and open them with **xarray**. Example: ```python from huggingface_hub import hf_hub_download import xarray as xr path = hf_hub_download( repo_id="isp-uv-es/ClimX", repo_type="dataset", filename="PATH/TO/A/FILE.nc", # replace with an actual file in this dataset repo ) ds = xr.open_dataset(path) print(ds) ``` ## Links - [Kaggle main track](https://www.kaggle.com/competitions/climx) - [Kaggle UQ track](https://www.kaggle.com/competitions/clim-x-uq-track) - [Full dataset (this page)](https://huggingface.co/datasets/isp-uv-es/ClimX) - [Public code repository (challenge materials)](https://github.com/IPL-UV/ClimX) - [Website](https://ipl-uv.github.io/ClimX/) ## Sponsorship ClimX is supported by ESA Phi-lab, which sponsors challenge prizes and travel support for winning teams. ## License and usage The dataset is released under **MIT**. In addition, if you are participating in the ClimX competition, please follow the competition rules (notably: restrictions on external climate training data and redistribution of competition data).

提供机构：

isp-uv-es

搜集汇总

数据集介绍

构建方式

ClimX数据集基于NorESM2-MM地球系统模型构建，旨在为极端气候感知的机器学习仿真提供基准。该数据集采用历史时期（1850年至2014年）及多种共享社会经济路径情景（SSP126、SSP370、SSP585）作为训练集，同时将SSP2-4.5情景作为独立测试集，以确保模型评估的严谨性。数据以NetCDF-4格式存储，包含完整分辨率（约200GB）和轻量分辨率（小于1GB）两种版本，分别适用于大规模训练与快速原型开发。时空维度覆盖全球网格，历史数据时间步长达60224天，投影数据为31389天，为气候模拟研究提供了高保真度的基础。

特点

ClimX数据集的核心特点在于其专注于气候极端事件的评估，而非仅关注平均气候状态。数据集包含七个地表变量的每日二维场数据，如气温、降水、湿度等，并衍生出15个ETCCDI风格的极端指数作为基准目标。这些指数通过区域加权归一化纳什-萨特克利夫效率进行量化，确保了评估的全面性与科学性。数据以惰性加载的虚拟xarray数据集形式提供，支持Dask并行处理，兼顾了计算效率与灵活性。此外，数据集严格划分训练与测试情景，避免数据泄漏，为机器学习模型在气候极端预测中的可靠性设立了高标准。

使用方法

使用ClimX数据集时，推荐通过官方代码库中的辅助加载模块进行访问。用户需克隆GitHub仓库并安装依赖库，利用`download_climx_from_hf`函数将数据下载至本地缓存，随后通过`open_climx_virtual_datasets`函数以惰性方式加载为三个虚拟数据集，分别对应历史数据、训练情景及测试情景的强迫场。这种方法支持大规模数据的高效流式处理。替代方案是直接通过Hugging Face Hub下载NetCDF文件，并使用xarray库打开，适用于自定义数据管道。无论采用何种方式，用户需注意测试集仅提供强迫场数据，目标变量需通过模型预测生成，以符合挑战赛的评估规范。

背景与挑战

背景概述

在气候变化研究领域，地球系统模型（ESM）是模拟未来气候情景的关键工具，但其高计算成本限制了广泛应用。为应对这一挑战，机器学习驱动的气候模型仿真器应运而生，旨在以更低计算开销提供准确预测。ClimX数据集由IPL-UV等机构于近年创建，专注于构建针对NorESM2-MM地球系统模型的快速准确仿真器，其核心研究问题在于提升对气候极端事件的预测能力，而非仅关注平均气候状态。该数据集通过提供历史及多种共享社会经济路径（SSP）情景下的高分辨率日尺度数据，推动了气候信息学与极端事件分析的交叉融合，为开发高效仿真算法奠定了重要基础。

当前挑战

ClimX数据集旨在解决气候模型仿真中极端事件预测的挑战，这要求模型不仅捕捉平均气候态，还需准确模拟罕见但高影响的极端温度与降水事件，其评估指标聚焦于15个ETCCDI风格极端指数，增加了预测复杂性。在构建过程中，数据集面临多维度难题：一是处理全分辨率数据（约200GB）带来的存储与计算负荷，需设计轻量化版本以支持快速原型开发；二是确保时空一致性，避免训练与测试情景间数据泄露，尤其需在SSP2-4.5等未公开目标情景下进行严格评估；三是整合多变量日尺度场与衍生极端指数的转换流程，要求仿真器具备高阶时空建模能力。

常用场景

经典使用场景

在气候建模领域，ClimX数据集为构建快速且精确的地球系统模型机器学习仿真器提供了关键基准。该数据集的核心应用场景在于训练模型预测七个地表变量的每日二维场，包括温度、降水、湿度和风速等，并进一步评估15个气候极端指数。这一过程要求模型不仅模拟平均气候状态，还需精准捕捉极端气候事件的变化趋势，从而推动气候模型仿真技术向高效与精细化方向发展。

衍生相关工作

围绕ClimX数据集，已衍生出多项经典研究工作，主要集中在机器学习与气候科学的交叉领域。例如，研究者利用该数据集开发了基于Transformer或卷积神经网络的时空预测模型，以提升对气候极端事件的模拟精度。此外，一些工作专注于不确定性量化方法，通过集成学习或概率建模来评估仿真结果的可靠性。这些进展不仅推动了气候仿真技术的发展，也为更广泛的地球系统科学问题提供了新的方法论启示。

数据集最近研究