Can ingredients based forecasting be learned? Disentangling a random forest's severe weather predictions
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.0rxwdbs7w
下载链接
链接失效反馈官方服务:
资源简介:
Machine learning (ML)-based models have been rapidly integrated into forecast practices across the weather forecasting community in recent years. While ML tools introduce additional data to forecasting operations, there is a need for explainability to be available alongside the model output, such that the guidance can be transparent and trustworthy for the forecaster. This work makes use of the algorithm tree interpreter (TI) to disaggregate the contributions of meteorological features used in the Colorado State University Machine Learning Probabilities (CSU-MLP) system, a random forest-based ML tool that produces real-time probabilistic forecasts for severe weather using inputs from the Global Ensemble Forecast System v12. TI feature contributions are analyzed in time and space for CSU-MLP day-2 and 3 individual hazard (tornado, wind, and hail) forecasts and day-4 aggregate severe forecasts over a 2-yr period. For individual forecast periods, this work demonstrates that feature contributions derived from TI can be interpreted in an ingredients-based sense, effectively making the CSU-MLP probabilities physically interpretable. When investigated in an aggregate sense, TI illustrates that the CSU-MLP system's predictions use meteorological inputs in ways that are consistent with the spatiotemporal patterns seen in meteorological fields that pertain to severe storms climatology. This work concludes with a discussion on how these insights could be beneficial for model development, real-time forecast operations, and retrospective event analysis.
Methods
Forecast data: These data include publically available local storm reports (from NOAA), publically available Storm Prediction Center (SPC) outlooks, and forecasts generated from the machine learning prediction system detailed in the manuscript. The local storm reports were retrieved from an online public-facing archive and gridded to NCEP grid 4. The SPC outlooks were originally in a shapefile format and ArcGIS was used to convert the shapefiles to a netCDF format. Then, the netCDF gridded SPC outlooks were regridded to NCEP grid 4 to conduct verification with local storm reports. Lastly, the machine learning-based forecasts are generated on the NCEP grid. Each of these datasets are then combined in a 'master' netCDF file for each forecast leadtime examined in the study (day 2, day 3, and day 4) for easy compression and storage. The master netCDF files additionally have metadata associated with the latitude and longitude points of the grid and forecast day strings. Forecasts span October 2020 through April 2023.
Feature contributions: Feature contributions were calculated from the machine learning forecasts described above using the treeinterpreter package for python. For each forecast day for a given lead time and hazard type (tornado, wind, hail, severe), feature contributions are calculated for all environmental predictors in the dataset (~6,600). For each grid point, the feature contributions are summed according to the spatial neighborhood described in the methods of this manuscript for dimensionality reduction purposes. Thus, for a given forecast, the contributions have dimensions of environmental variable, forecast hour, latitude, and longitude. TI contributions corresponding to two years of machine learning forecasts (2021-2022) are combined into single netCDF files for each forecast hazard and lead time (i.e. 7 total files, day 2 tornado, wind, and hail, day 3 tornado, wind, and hail, and day 4 "any severe").
More details on the methods surrounding each of these datasets can be found in the methods section of the manuscript associated with this work.
创建时间:
2024-05-06



