Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://figshare.com/articles/dataset/Enlarging_Applicability_Domain_of_Quantitative_Structure_Activity_Relationship_Models_through_Uncertainty-Based_Active_Learning/19099781
下载链接
链接失效反馈官方服务:
资源简介:
The
first step to develop a quantitative structure–activity
relationship (QSAR) model is to identify a set of chemicals with known
activities/properties, which can be either collected from the published
studies or measured experimentally. A key challenge in this process
is how to determine which chemicals are used to train a QSAR model,
and, of those chemicals, which should be prioritized in experimental
trials to ensure that the obtained models have large applicability
domains (ADs). In this study, we employ uncertainty-based active learning
(AC) to address this challenge. We use the Gaussian process (GP) to
develop QSAR models for three public datasets, Koc, solubility, and k•OH, each with a number of chemicals
represented by molecular descriptors, in which the GP can offer prediction
uncertainty (by means of standard deviation) for the model’s
prediction. The training chemicals of each dataset are selected in
two different ways: (1) random splitting (RS) and (2) uncertainty-based
AC. Uncertainty-based AC iteratively identifies chemicals with the
highest uncertainty and selects them for model training. We demonstrate
that the chemicals selected by AC are more diverse than those selected
by RS and that AC-based QSAR models have better generalizability than
those derived from RS. We then use these two types of models to predict
the properties of chemicals in the REACH dataset (>300,000 chemicals)
and assess their ADs using five different AD determination methods.
We demonstrate that the AD of AC-based QSAR models for all AD methods
is significantly larger than those of RS-based models (up to 24 times
larger). This study provides a novel method to enlarge the AD of QSAR
models, which can guide model development and improve the property
prediction reliability for more REACH dataset chemicals while minimizing
the development cost and time.
创建时间:
2022-01-31



