five

Using Random Forest To Model the Domain Applicability of Another Random Forest Model

收藏
NIAID Data Ecosystem2026-03-08 收录
下载链接:
https://figshare.com/articles/dataset/Using_Random_Forest_To_Model_the_Domain_Applicability_of_Another_Random_Forest_Model/2350480
下载链接
链接失效反馈
官方服务:
资源简介:
In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities. We will call this traditional type of QSAR model an “activity model”. The activity model can be used to predict the activities of molecules not in the training set. A relatively new subfield for QSAR is domain applicability. The aim is to estimate the reliability of prediction of a specific molecule on a specific activity model. A number of different metrics have been proposed in the literature for this purpose. It is desirable to build a quantitative model of reliability against one or more of these metrics. We can call this an “error model”. A previous publication from our laboratory (Sheridan J. Chem. Inf. Model., 2012, 52, 814–823.) suggested the simultaneous use of three metrics would be more discriminating than any one metric. An error model could be built in the form of a three-dimensional set of bins. When the number of metrics exceeds three, however, the bin paradigm is not practical. An obvious solution for constructing an error model using multiple metrics is to use a QSAR method, in our case random forest. In this paper we demonstrate the usefulness of this paradigm, specifically for determining whether a useful error model can be built and which metrics are most useful for a given problem. For the ten data sets and for the seven metrics we examine here, it appears that it is possible to construct a useful error model using only two metrics (TREE_SD and PREDICTED). These do not require calculating similarities/distances between the molecules being predicted and the molecules used to build the activity model, which can be rate-limiting.

在定量构效关系(Quantitative Structure-Activity Relationship, QSAR)研究中,通常基于由化学描述符(chemical descriptors)表征的分子及其生物活性构成的训练集,构建统计模型。我们将这类传统QSAR模型称为“活性模型(activity model)”。该活性模型可用于预测训练集以外分子的生物活性。 QSAR领域中一个相对新兴的子方向为适用域(applicability domain)研究,其目标为评估特定活性模型对某一分子的预测可靠性。已有诸多文献针对该目标提出了多种不同的评估指标(metrics)。此时可基于上述一项或多项指标构建可靠性定量模型,我们将其称为“误差模型(error model)”。 本团队此前发表的一项研究(Sheridan, *J. Chem. Inf. Model.*, 2012, 52, 814–823.)指出,同时使用三项指标相较于单一指标,具备更强的区分能力。误差模型可通过三维分箱(three-dimensional set of bins)的形式构建,但当指标数量超过三项时,分箱范式便不再具备实用性。针对多指标下的误差模型构建,一个直观的解决方案是采用QSAR建模方法,本研究中选用的是随机森林(random forest)。 本文将验证该范式的应用价值,具体聚焦于两大核心问题:一是针对特定研究任务是否可构建有效的误差模型;二是哪些指标对该任务最为适用。针对本次研究所涉及的10个数据集与7项指标,结果显示仅需采用两项指标(TREE_SD与PREDICTED)即可构建有效的误差模型。这两项指标无需计算待预测分子与构建活性模型所用分子间的相似度/距离,而该计算步骤往往是建模流程中的速率瓶颈。
创建时间:
2016-02-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作