Observation definitions and their implications in machine learning-based predictions of excessive rainfall
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.kwh70rzdx
下载链接
链接失效反馈官方服务:
资源简介:
The implications of definitions of excessive rainfall observations on machine learning-model forecast skill is assessed using the Colorado State University Machine Learning Probabilities (CSU-MLP) forecast system. The CSU-MLP uses historical observations along with reforecasts from a global ensemble to train random forests to probabilistically predict excessive rainfall events. Here, random forest models are trained using two distinct rainfall datasets, one that is composed of fixed-frequency (FF) average recurrence intervals exceedances and flash flood reports, and the other a compilation of flooding and rainfall proxies (Unified Flood Verification System; UFVS). Both models generate 1-3 day forecasts and are evaluated against a climatological baseline to characterize their overall skill as a function of lead time, season, and region. Model comparisons suggest that regional frequencies in excessive rainfall observations contribute to when and where the ML models issue forecasts, and subsequently their skill and reliability. Additionally, the spatio-temporal distribution of observations have implications for ML model training requirements, notably, how long of an observational record is needed to obtain skillful forecasts. Experiments reveal that shorter-trained UFVS-based models can be as skillful as longer-trained FF-based models. In essence, the UFVS dataset exhibits a more robust characterization of excessive rainfall and impacts, and machine learning models trained on more representative datasets of meteorological hazards may not require as extensive training to generate skillful forecasts.
Methods
These data include publicly available observations of the UFVS datasets from NOAA, publicly available Weather Prediction Center (WPC) excessive rainfall outlooks, and forecasts generated from the machine learning prediction system detailed in the corresponding manuscript. The UFVS observations were also retrieved from an online repository, the files for which included WPC excessive rainfall outlooks at varying lead times (e.g., day1, day2, etc.). The machine learning-based forecasts are generated on NCEP grid 4 by default and therefore are regridded to match that of the excessive rainfall outlook. Each of these datasets are then combined in a 'master' netCDF file for each forecast day for easy compression and storage (e.g., day1_csu_mlp_20201005_20231003.nc). The master netCDF files additionally have metadata associated with the latitude and longitude points of the grid and forecast day strings. Additional data are provided related to a climatology of flash flooding events and training datasets discussed in the corresponding manuscript.
以科罗拉多州立大学机器学习概率预测系统(Colorado State University Machine Learning Probabilities, CSU-MLP)为评估工具,本研究探讨了极端降雨观测定义对机器学习模型预报技巧的影响。该系统利用历史观测数据与全球集合再预报结果训练随机森林模型,以概率化方式预测极端降雨事件。本次实验中,研究团队使用两套不同的降雨数据集训练随机森林模型:其一为固定频率(fixed-frequency, FF)平均重现期超标事件与山洪报告数据集,其二为洪水与降雨代理指标汇编数据集(统一洪水验证系统,Unified Flood Verification System, UFVS)。两套模型均可生成1至3天的预报,并与气候学基线对比,以分析其整体预报技巧随预报提前期、季节与区域的变化特征。模型对比结果显示,极端降雨观测的区域频率特征决定了机器学习模型的预报发布时机与区域,进而影响模型的预报技巧与可靠性。此外,观测数据的时空分布对机器学习模型的训练需求具有重要影响,具体而言,即获取具备预报技巧的结果所需的观测记录长度。实验表明,训练周期更短的基于UFVS的模型,其预报技巧可媲美训练周期更长的基于FF的模型。本质而言,UFVS数据集对极端降雨事件及其影响的刻画更为全面可靠;基于更具代表性的气象灾害数据集训练的机器学习模型,无需过长的训练周期即可生成具备预报技巧的结果。
### 方法
本研究使用的数据集包括:美国国家海洋和大气管理局(National Oceanic and Atmospheric Administration, NOAA)公开的UFVS数据集观测资料、美国国家天气预报中心(Weather Prediction Center, WPC)公开的极端降雨展望产品,以及对应论文中详述的机器学习预测系统生成的预报结果。UFVS观测数据取自公开在线仓库,其中包含不同提前期的WPC极端降雨展望产品(如第1天、第2天展望等)。本研究生成的机器学习预报默认采用NCEP第4网格,因此需重新网格化以匹配极端降雨展望产品的网格格式。随后,所有数据集将被整合为每份预报日对应的「主」netCDF文件,以实现高效压缩与存储(例如:day1_csu_mlp_20201005_20231003.nc)。此类主netCDF文件还包含网格经纬度坐标与预报日期字符串等元数据。此外,本研究还提供了与山洪事件气候学特征及对应论文中提及的训练数据集相关的附加数据。
创建时间:
2024-10-07



