Data for Analyzing the Effect of Data Splitting and Covariate Shift on Machine Leaning Based Streamflow Prediction in Ungauged Basins

Name: Data for Analyzing the Effect of Data Splitting and Covariate Shift on Machine Leaning Based Streamflow Prediction in Ungauged Basins
Creator: Purdue University Research Repository
Published: 2025-12-18 22:18:59
License: 暂无描述

DataCite Commons2025-12-18 更新2025-04-16 收录

下载链接：

https://purr.purdue.edu/publications/4205/1

下载链接

链接失效反馈

官方服务：

资源简介：

This study aims to build a Prediction in Ungauged Basins (PUB) model to predict daily streamflow series in ungauged basins in Ohio Region (HUC2; 05) from 2010 to 2020. Daily streamflow data from USGS gaging reaches are the output variable&nbsp;of PUB models. The input variables of&nbsp;PUB models in this study are&nbsp;watershed characteristics and meteorological variables. The watershed characteristics include drainage area, soil, and land-use condition of watersheds. The meteorological variables include daily observations of rainfall, snowfall, snow depth, and temperature. This dataset includes the input and output for the PUB models used in&nbsp;the study &quot;Analyzing the Effect of Data Splitting and Covariate Shift on Machine Leaning Based Streamflow Prediction in Ungauged Basins&quot;.&nbsp; Article Abstract: Machine learning (ML) models are attractive alternatives to traditional hydrologic modeling for streamflow predictions in ungauged basins (PUB). However, the ungauged basins with highly variable watershed characteristics add uncertainties to PUB frameworks based on ML models. This uncertainty in the PUB frameworks may be due to the improper data splitting process of training/testing sets and the resulting covariate shift. Covariate shift refers to the inconsistency between the dataset used to train and test a ML model and the real-world (global) dataset on which the trained model is implemented. In real-world applications, covariate shift remains a thorny issue for ML but has seldom been investigated in a hydrologic setting. This study aims to evaluate the uncertainty in ML-based PUB under the influence of data splitting and covariate shift using the Monte Carlo method. The Monte Carlo method accumulates simulation results with different data split arrangements into the predictive distributions of pseudo ungauged reaches. Results indicate that ML performance is not robust under covariate shift. ML performance is influenced by the number of heterogeneous variables and specific watershed characteristics displaying heterogeneity, such as drainage area, dam influence, and urbanized percentage. Additionally, a comparison of predictive distributions between the ML performance in the best/worst group and overall reaches shows that the ML algorithm is significantly influenced by the variability in basin characteristics, such as dam density, drainage area, and meteorological variables, such as snowfall and precipitation.

提供机构：

Purdue University Research Repository

创建时间：

2023-01-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集