STarFish: A Stacked Ensemble Target Fishing Approach and its Application to Natural Products
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://figshare.com/articles/dataset/STarFish_A_Stacked_Ensemble_Target_Fishing_Approach_and_its_Application_to_Natural_Products/10040981
下载链接
链接失效反馈官方服务:
资源简介:
Target fishing is the process of
identifying the protein target
of a bioactive small molecule. To do so experimentally requires a
significant investment of time and resources, which can be expedited
with a reliable computational target fishing model. The development
of computational target fishing models using machine learning has
become very popular over the last several years because of the increased
availability of large amounts of public bioactivity data. Unfortunately,
the applicability and performance of such models for natural products
has not yet been comprehensively assessed. This is, in part, due to
the relative lack of bioactivity data available for natural products
compared to synthetic compounds. Moreover, the databases commonly
used to train such models do not annotate which compounds are natural
products, which makes the collection of a benchmarking set difficult.
To address this knowledge gap, a data set composed of natural product
structures and their associated protein targets was generated by cross-referencing
20 publicly available natural product databases with the bioactivity
database ChEMBL. This data set contains 5589 compound–target
pairs for 1943 unique compounds and 1023 unique targets. A synthetic
data set comprising 107 190 compound–target pairs for
88 728 unique compounds and 1907 unique targets was used to
train k-nearest neighbors, random forest, and multilayer
perceptron models. The predictive performance of each model was assessed
by stratified 10-fold cross-validation and benchmarking on the newly
collected natural product data set. Strong performance was observed
for each model during cross-validation with area under the receiver
operating characteristic (AUROC) scores ranging from 0.94 to 0.99
and Boltzmann-enhanced discrimination of receiver operating characteristic
(BEDROC) scores from 0.89 to 0.94. When tested on the natural product
data set, performance dramatically decreased with AUROC scores ranging
from 0.70 to 0.85 and BEDROC scores from 0.43 to 0.59. However, the
implementation of a model stacking approach, which uses logistic regression
as a meta-classifier to combine model predictions, dramatically improved
the ability to correctly predict the protein targets of natural products
and increased the AUROC score to 0.94 and BEDROC score to 0.73. This
stacked model was deployed as a web application, called STarFish,
and has been made available for use to aid in target identification
for natural products.
创建时间:
2019-10-07



