The Relative Importance of Domain Applicability Metrics for Estimating Prediction Errors in QSAR Varies with Training Set Diversity
收藏NIAID Data Ecosystem2026-03-08 收录
下载链接:
https://figshare.com/articles/dataset/The_Relative_Importance_of_Domain_Applicability_Metrics_for_Estimating_Prediction_Errors_in_QSAR_Varies_with_Training_Set_Diversity/2156140
下载链接
链接失效反馈官方服务:
资源简介:
In
QSAR, a statistical model is generated from a training set of
molecules (represented by chemical descriptors) and their biological
activities (an “activity model”). The aim of the field
of domain applicability (DA) is to estimate the uncertainty of prediction
of a specific molecule on a specific activity model. A number of DA
metrics have been proposed in the literature for this purpose. A quantitative
model of the prediction uncertainty (an “error model”)
can be built using one or more of these metrics. A previous publication
from our laboratory (Sheridan, R. P. J.
Chem. Inf. Model. 2013, 53, 2837−2850) suggested that
QSAR methods such as random forest could be used to build error models
by fitting unsigned prediction errors against DA metrics. The QSAR
paradigm contains two useful techniques: descriptor importance can
determine which DA metrics are most useful, and cross-validation can
be used to tell which subset of DA metrics is sufficient to estimate
the unsigned errors. Previously we studied 10 large, diverse data
sets and seven DA metrics. For those data sets for which it is possible
to build a significant error model from those seven metrics, only
two metrics were sufficient to account for almost all of the information
in the error model. These were TREE_SD (the variation of prediction
among random forest trees) and PREDICTED (the predicted activity itself).
In this paper we show that when data sets are less diverse, as for
example in QSAR models of molecules in a single chemical series, these
two DA metrics become less important in explaining prediction error,
and the DA metric SIMILARITYNEAREST1 (the similarity of the molecule
being predicted to the closest training set compound) becomes more
important. Our recommendation is that when the mean pairwise similarity
(measured with the Carhart AP descriptor and the Dice similarity index)
within a QSAR training set is less than 0.5, one can use only TREE_SD, PREDICTED
to form the error model, but otherwise one should use TREE_SD, PREDICTED, SIMILARITYNEAREST1.
创建时间:
2016-02-13



