In Silico Target Predictions: Defining a Benchmarking Data Set and Comparison of Performance of the Multiclass Naïve Bayes and Parzen-Rosenblatt Window
收藏NIAID Data Ecosystem2026-03-07 收录
下载链接:
https://figshare.com/articles/dataset/In_Silico_Target_Predictions_Defining_a_Benchmarking_Data_Set_and_Comparison_of_Performance_of_the_Multiclass_Nai_ve_Bayes_and_Parzen_Rosenblatt_Window/2383792
下载链接
链接失效反馈官方服务:
资源简介:
In
this study, two probabilistic machine-learning algorithms were compared
for in silico target prediction of bioactive molecules, namely the
well-established Laplacian-modified Naïve Bayes classifier
(NB) and the more recently introduced (to Cheminformatics) Parzen-Rosenblatt
Window. Both classifiers were trained in conjunction with circular
fingerprints on a large data set of bioactive compounds extracted
from ChEMBL, covering 894 human protein targets with more than 155,000
ligand-protein pairs. This data set is also provided as a benchmark
data set for future target prediction methods due to its size as well
as the number of bioactivity classes it contains. In addition to evaluating
the methods, different performance measures were explored. This is
not as straightforward as in binary classification settings, due to
the number of classes, the possibility of multiple class memberships,
and the need to translate model scores into “yes/no”
predictions for assessing model performance. Both algorithms achieved
a recall of correct targets that exceeds 80% in the top 1% of predictions.
Performance depends significantly on the underlying diversity and
size of a given class of bioactive compounds, with small classes and
low structural similarity affecting both algorithms to different degrees.
When tested on an external test set extracted from WOMBAT covering
more than 500 targets by excluding all compounds with Tanimoto similarity
above 0.8 to compounds from the ChEMBL data set, the current methodologies
achieved a recall of 63.3% and 66.6% among the top 1% for Naïve
Bayes and Parzen-Rosenblatt Window, respectively. While those numbers
seem to indicate lower performance, they are also more realistic for
settings where protein targets need to be established for novel chemical
substances.
创建时间:
2013-08-26



