five

Data from: Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure

收藏
DataONE2016-12-08 更新2024-06-26 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈
官方服务:
资源简介:
Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross-validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence. Even more concerning, because often overlooked, is that structured data also provides ample opportunity for overfitting with non-causal predictors. This problem can persist even if remedies such as autoregressive models, generalized least squares, or mixed models are used. Block cross-validation, where data are split strategically rather than randomly, can address these issues. However, the blocking strategy must be carefully considered. Blocking in space, time, random effects or phylogenetic distance, while accounting for dependencies in the data, may also unwittingly induce extrapolations by restricting the ranges or combinations of predictor variables available for model training, thus overestimating interpolation errors. On the other hand, deliberate blocking in predictor space may also improve error estimates when extrapolation is the modelling goal. Here, we review the ecological literature on non-random and blocked cross-validation approaches. We also provide a series of simulations and case studies, in which we show that, for all instances tested, block cross-validation is nearly universally more appropriate than random cross-validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We recommend that block cross-validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.

生态数据通常呈现出时间、空间、层级(随机效应,random effects)或系统发育结构。现代统计方法正愈发重视这类依存关系。然而在开展交叉验证时,此类结构常被忽略,进而造成预测误差的严重低估。建模者常提及的、未经校正的(随机)交叉验证表现不佳的原因之一,是数据中的依存结构会延续至模型残差中,违背了独立性假设。更值得警惕且常被忽视的是,结构化数据还为使用非因果预测变量的过拟合提供了充足空间。即便采用自回归模型、广义最小二乘法(generalized least squares)或混合模型等校正手段,该问题仍可能存续。分块交叉验证(block cross-validation)通过策略性而非随机地拆分数据,可解决上述问题。但分块策略需审慎考量:按空间、时间、随机效应或系统发育距离进行分块,虽能兼顾数据中的依存关系,却可能因限制了模型训练可用的预测变量范围或组合,无意间引发外推问题,从而高估插值误差。另一方面,若建模目标为外推,则在预测变量空间中刻意实施分块,也可优化误差估计效果。本文综述了生态学领域中关于非随机分块交叉验证方法的相关研究。此外,我们开展了一系列模拟实验与案例研究,结果显示:在所有测试场景中,若目标为对新数据或新预测变量空间进行预测,或是筛选因果预测变量,分块交叉验证几乎始终优于随机交叉验证。我们建议:只要数据集存在依存结构,即便拟合模型的残差中未显现出相关结构,或是拟合模型已考虑了此类相关关系,均应使用分块交叉验证。
创建时间:
2016-12-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作