Data from: Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure
收藏DataCite Commons2025-06-01 更新2025-05-10 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.737gk
下载链接
链接失效反馈官方服务:
资源简介:
Ecological data often show temporal, spatial, hierarchical (random
effects), or phylogenetic structure. Modern statistical approaches are
increasingly accounting for such dependencies. However, when performing
cross-validation, these structures are regularly ignored, resulting in
serious underestimation of predictive error. One cause for the poor
performance of uncorrected (random) cross-validation, noted often by
modellers, are dependence structures in the data that persist as
dependence structures in model residuals, violating the assumption of
independence. Even more concerning, because often overlooked, is that
structured data also provides ample opportunity for overfitting with
non-causal predictors. This problem can persist even if remedies such as
autoregressive models, generalized least squares, or mixed models are
used. Block cross-validation, where data are split strategically rather
than randomly, can address these issues. However, the blocking strategy
must be carefully considered. Blocking in space, time, random effects or
phylogenetic distance, while accounting for dependencies in the data, may
also unwittingly induce extrapolations by restricting the ranges or
combinations of predictor variables available for model training, thus
overestimating interpolation errors. On the other hand, deliberate
blocking in predictor space may also improve error estimates when
extrapolation is the modelling goal. Here, we review the ecological
literature on non-random and blocked cross-validation approaches. We also
provide a series of simulations and case studies, in which we show that,
for all instances tested, block cross-validation is nearly universally
more appropriate than random cross-validation if the goal is predicting to
new data or predictor space, or for selecting causal predictors. We
recommend that block cross-validation be used wherever dependence
structures exist in a dataset, even if no correlation structure is visible
in the fitted model residuals, or if the fitted models account for such
correlations.
提供机构:
Dryad
创建时间:
2016-12-08



