Data from: Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure

DataONE2016-12-08 更新2024-06-26 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross-validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence. Even more concerning, because often overlooked, is that structured data also provides ample opportunity for overfitting with non-causal predictors. This problem can persist even if remedies such as autoregressive models, generalized least squares, or mixed models are used. Block cross-validation, where data are split strategically rather than randomly, can address these issues. However, the blocking strategy must be carefully considered. Blocking in space, time, random effects or phylogenetic distance, while accounting for dependencies in the data, may also unwittingly induce extrapolations by restricting the ranges or combinations of predictor variables available for model training, thus overestimating interpolation errors. On the other hand, deliberate blocking in predictor space may also improve error estimates when extrapolation is the modelling goal. Here, we review the ecological literature on non-random and blocked cross-validation approaches. We also provide a series of simulations and case studies, in which we show that, for all instances tested, block cross-validation is nearly universally more appropriate than random cross-validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We recommend that block cross-validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.

生态学数据往往呈现时间、空间、层级（随机效应）或系统发育结构。现代统计方法正愈发注重对这类依赖关系的考量。然而在开展交叉验证（cross-validation）时，此类结构常被忽略，导致预测误差被严重低估。建模者常指出，未校正的（随机）交叉验证效果不佳的一个诱因，是数据中的依赖结构会延续至模型残差（model residuals）中，违背了独立性假设。更值得警惕且常被忽视的是，结构化数据还为使用非因果预测变量进行过拟合提供了充足空间。即便采用自回归模型（autoregressive models）、广义最小二乘法（generalized least squares）或混合效应模型（mixed models）等校正手段，该问题仍可能存续。块交叉验证（Block cross-validation）通过策略性拆分而非随机拆分数据集，可解决上述问题。但分组策略需审慎考量：在空间、时间、随机效应或系统发育距离维度进行分组，虽能刻画数据中的依赖关系，却可能因限制了模型训练所用预测变量（predictor variables）的取值范围或组合方式，在无意间引发外推（extrapolations）问题，进而高估内插误差（interpolation errors）。反之，若建模目标为外推，则刻意在预测变量空间中进行分组，反而可优化误差估计效果。本文综述了生态学领域关于非随机分组与块交叉验证方法的研究进展。我们还开展了一系列模拟实验与案例研究，结果表明：在所有测试场景中，若目标是对新数据或新预测变量空间进行预测，或是筛选因果预测变量，块交叉验证几乎无一例外优于随机交叉验证。我们建议，只要数据集存在依赖结构，即便拟合模型的残差中未显现相关结构，或是拟合模型已考虑了这类相关性，均应采用块交叉验证。

创建时间：

2016-12-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集