five

Supplementary Material to Letter to the Editor concerning: https://doi.org/10.1016/j.scitotenv.2024.175642

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14902118
下载链接
链接失效反馈
官方服务:
资源简介:
This is the supplementary material to Letter to the Editor regarding Feeney et al. 2024, “Benchmarking soil organic carbon (SOC) concentration provides more robust soil health assessment than the SOC/clay ratio at European scale” by Luca G. Bernardini, Elisa Bruni, Emma Izquierdo-Verdiguier, Eric Smit &  Christoph Rosinger. The repository includes two datasets described in the publication and sample codes to replicate our results. Datasets: The two datasets used for this analysis contained the same variables, but differed in the number of entries, regional extent and number of unique combinations. The prediction variables for SOC benchmarks were soil texture class (based on the Austrian classification (ÖNORM L1050), Environmental Zone (based on the Alterra report 2281), and broad land use classes (grassland or cropland). SOC content was log-transformed.  First, the Austrian Agricultural Soil Survey (AASS_rpart, n = 6312) is a systematic survey of cropland and grassland soils in Austria which was initiated in the 1970s; for more details, we refer to Baumgarten et al. (2021).  Second, the Land Use and Coverage Area frame Survey (LUCAS_rpart, n = 14246, Orgiazzi et al., 2018) was aggregated for the three sampling intervals (2009, 2015 and 2018) and filtered to contain exclusively observations on cropland and grassland.  Sample codes: Three sample R codes are provided to replicate the results of this work, following its two main arguments: First, we show the effect of tuneLength (a parameter in the train function) on the number of terminal nodes, which results in larger models with higher tuneLength values. In a nutshell, the higher tuneLength parameter leads to a finer tuning grid, allowing for larger models, which caret selects as optimal, due to lower RMSEs (RMSE-based). While this approach is often sensible in machine learning, when using decision trees for generating representative groups, this approach can lead to a biased result. To overcome this issue, we optimize the model based on a robust RULE-based approach, which selects the simplest model within one standard error of the best performing model, leading to a parsimonious model selection. Second, following Feeney et al. (2024), we trained the models on 5000 iterations with different splits of the dataset. To select the most representative model, the former publication used the model with the lowest RMSE (best model) over 5000 iterations. We show that, if the focus is on robustness, one should select the most frequently occurring model over 5000 iterations. This leads to a more representative model, at a minimal cost in terms of RMSE. Sample code caret rpart model optimization RMSE-based.R: This code shows the effect of tuneLength on the number of terminal nodes. Sample code rpart model optimization RULE-based.R: This code shows an alternative approach to model optimization, based on parsimonious model selection.  Sample code Final model selection most frequent vs best rpart.R: This code shows the effect of selecting the best model (lowest RMSE) and a rule-based model selection on the frequency of appearance of the terminal nodes within the 5000 iterations.
创建时间:
2025-02-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作