Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO
收藏NIAID Data Ecosystem2026-03-08 收录
下载链接:
https://figshare.com/articles/dataset/Recursive_Random_Forests_Enable_Better_Predictive_Performance_and_Model_Interpretation_than_Variable_Selection_by_LASSO/2173198
下载链接
链接失效反馈官方服务:
资源简介:
Variable selection is of crucial
significance in QSAR modeling
since it increases the model predictive ability and reduces noise.
The selection of the right variables is far more complicated than
the development of predictive models. In this study, eight continuous
and categorical data sets were employed to explore the applicability
of two distinct variable selection methods random forests (RF) and
least absolute shrinkage and selection operator (LASSO). Variable
selection was performed: (1) by using recursive random forests to
rule out a quarter of the least important descriptors at each iteration
and (2) by using LASSO modeling with 10-fold inner cross-validation
to tune its penalty λ for each data set. Along with regular
statistical parameters of model performance, we proposed the highest
pairwise correlation rate, average pairwise Pearson’s correlation
coefficient, and Tanimoto coefficient to evaluate the optimal by RF
and LASSO in an extensive way. Results showed that variable selection
could allow a tremendous reduction of noisy descriptors (at most 96%
with RF method in this study) and apparently enhance model’s
predictive performance as well. Furthermore, random forests showed
property of gathering important predictors without restricting their
pairwise correlation, which is contrary to LASSO. The mutual exclusion
of highly correlated variables in LASSO modeling tends to skip important
variables that are highly related to response endpoints and thus undermine
the model’s predictive performance. The optimal variables selected
by RF share low similarity with those by LASSO (e.g., the Tanimoto
coefficients were smaller than 0.20 in seven out of eight data sets).
We found that the differences between RF and LASSO predictive performances
mainly resulted from the variables selected by different strategies
rather than the learning algorithms. Our study showed that the right
selection of variables is more important than the learning algorithm
for modeling. We hope that a standard procedure could be developed
based on these proposed statistical metrics to select the truly important
variables for model interpretation, as well as for further use to
facilitate drug discovery and environmental toxicity assessment.
创建时间:
2016-02-13



