Codes for Analyzing the Effect of Data Splitting and Covariate Shift on Machine Leaning Based Streamflow Prediction in Ungauged Basins

Name: Codes for Analyzing the Effect of Data Splitting and Covariate Shift on Machine Leaning Based Streamflow Prediction in Ungauged Basins
Creator: Purdue University Research Repository
Published: 2025-12-18 22:19:00
License: 暂无描述

DataCite Commons2025-12-18 更新2025-04-16 收录

下载链接：

https://purr.purdue.edu/publications/4206/1

下载链接

链接失效反馈

官方服务：

资源简介：

ML models have been successfully applied and found to consistently outperform physics-based PUB models.&nbsp;A machine learning (ML) algorithm can learn streamflow generation processes from input features, such as meteorological variables and watershed characteristics. In this study, the Random Forest (RF) and Artificial Neural Network (ANN) predict daily streamflow in ungauged basin (PUB). These ML models can be used to predict streamflow time series in any other large scale watersheds. Article Abstrct: Machine learning (ML) models are attractive alternatives to traditional hydrologic modeling for streamflow predictions in ungauged basins (PUB). However, the ungauged basins with highly variable watershed characteristics add uncertainties to PUB frameworks based on ML models. This uncertainty in the PUB frameworks may be due to the improper data splitting process of training/testing sets and the resulting covariate shift. Covariate shift refers to the inconsistency between the dataset used to train and test a ML model and the real-world (global) dataset on which the trained model is implemented. In real-world applications, covariate shift remains a thorny issue for ML but has seldom been investigated in a hydrologic setting. This study aims to evaluate the uncertainty in ML-based PUB under the influence of data splitting and covariate shift using the Monte Carlo method. The Monte Carlo method accumulates simulation results with different data split arrangements into the predictive distributions of pseudo ungauged reaches. Results indicate that ML performance is not robust under covariate shift. ML performance is influenced by the number of heterogeneous variables and specific watershed characteristics displaying heterogeneity, such as drainage area, dam influence, and urbanized percentage. Additionally, a comparison of predictive distributions between the ML performance in the best/worst group and overall reaches shows that the ML algorithm is significantly influenced by the variability in basin characteristics, such as dam density, drainage area, and meteorological variables, such as snowfall and precipitation.

提供机构：

Purdue University Research Repository

创建时间：

2023-01-18