five

Data from: Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure

收藏
DataONE2016-12-08 更新2024-06-26 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈
官方服务:
资源简介:
Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross-validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence. Even more concerning, because often overlooked, is that structured data also provides ample opportunity for overfitting with non-causal predictors. This problem can persist even if remedies such as autoregressive models, generalized least squares, or mixed models are used. Block cross-validation, where data are split strategically rather than randomly, can address these issues. However, the blocking strategy must be carefully considered. Blocking in space, time, random effects or phylogenetic distance, while accounting for dependencies in the data, may also unwittingly induce extrapolations by restricting the ranges or combinations of predictor variables available for model training, thus overestimating interpolation errors. On the other hand, deliberate blocking in predictor space may also improve error estimates when extrapolation is the modelling goal. Here, we review the ecological literature on non-random and blocked cross-validation approaches. We also provide a series of simulations and case studies, in which we show that, for all instances tested, block cross-validation is nearly universally more appropriate than random cross-validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We recommend that block cross-validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.

生态数据常呈现时间、空间、分层(随机效应)或系统发育结构(Phylogenetic Structure)。现代统计方法愈发重视对这类依存关系的考量。然而在开展交叉验证(Cross-Validation)时,这类结构常被忽略,进而导致预测误差被严重低估。建模者常指出,未校正(随机)交叉验证表现不佳的一个原因,是数据中的依存结构会延续至模型残差(Model Residuals)中,违反了独立性假设(Independence Assumption)。更值得警惕且常被忽视的是,结构化数据还为使用非因果预测因子过拟合(Overfitting)提供了充足空间。即便采用自回归模型(Autoregressive Models)、广义最小二乘法(Generalized Least Squares)或混合模型(Mixed Models)等校正手段,该问题仍可能存在。 分组交叉验证(Block Cross-Validation)通过按策略而非随机的方式拆分数据,可解决上述问题,但需谨慎考量分组策略。按空间、时间、随机效应或系统发育距离进行分组,虽能考量数据中的依存关系,却可能因限制了模型训练可用的预测因子范围或组合,无意间引发外推问题,从而高估内插误差。反之,若建模目标为外推,刻意按预测因子空间进行分组,则可改善误差估计效果。 本文综述了生态学领域中关于非随机及分组交叉验证方法的相关研究,并通过一系列模拟研究与案例分析证实:在所有测试场景中,若目标为预测新数据或新预测因子空间,或是筛选因果预测因子,分组交叉验证几乎始终优于随机交叉验证。我们建议,只要数据集存在依存结构,即便拟合模型的残差中未显现相关结构,或拟合模型已考量了这类相关性,均应采用分组交叉验证。
创建时间:
2016-12-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作