Using Random Forest To Model the Domain Applicability of Another Random Forest Model
收藏NIAID Data Ecosystem2026-03-08 收录
下载链接:
https://figshare.com/articles/dataset/Using_Random_Forest_To_Model_the_Domain_Applicability_of_Another_Random_Forest_Model/2350501
下载链接
链接失效反馈官方服务:
资源简介:
In QSAR, a statistical model is generated
from a training set of
molecules (represented by chemical descriptors) and their biological
activities. We will call this traditional type of QSAR model an “activity
model”. The activity model can be used to predict the activities
of molecules not in the training set. A relatively new subfield for
QSAR is domain applicability. The aim is to estimate the reliability
of prediction of a specific molecule on a specific activity model.
A number of different metrics have been proposed in the literature
for this purpose. It is desirable to build a quantitative model of
reliability against one or more of these metrics. We can call this
an “error model”. A previous publication from our laboratory
(Sheridan J. Chem. Inf. Model., 2012, 52, 814–823.) suggested the simultaneous
use of three metrics would be more discriminating than any one metric.
An error model could be built in the form of a three-dimensional set
of bins. When the number of metrics exceeds three, however, the bin
paradigm is not practical. An obvious solution for constructing an
error model using multiple metrics is to use a QSAR method, in our
case random forest. In this paper we demonstrate the usefulness of
this paradigm, specifically for determining whether a useful error
model can be built and which metrics are most useful for a given problem.
For the ten data sets and for the seven metrics we examine here, it
appears that it is possible to construct a useful error model using
only two metrics (TREE_SD and PREDICTED). These do not require calculating
similarities/distances between the molecules being predicted and the
molecules used to build the activity model, which can be rate-limiting.
创建时间:
2013-11-25



