Development and rigorous validation of antimalarial predictive models using machine learning approaches
收藏DataCite Commons2020-08-26 更新2024-07-27 收录
下载链接:
https://tandf.figshare.com/articles/Development_and_rigorous_validation_of_antimalarial_predictive_models_using_machine_learning_approaches/8975951
下载链接
链接失效反馈官方服务:
资源简介:
The large collection of known and experimentally verified compounds from the ChEMBL database was used to build different classification models for predicting the antimalarial activity against <i>Plasmodium falciparum</i>. Four different machine learning methods, namely the support vector machine (SVM), random forest (RF), k-nearest neighbour (kNN) and XGBoost have been used for the development of models using the diverse antimalarial dataset from ChEMBL. A well-established feature selection framework was used to select the best subset from a larger pool of descriptors. Performance of the models was rigorously evaluated by evaluation of the applicability domain, Y-scrambling and AUC-ROC curve. Additionally, the predictive power of the models was also assessed using probability calibration and predictiveness curves. SVM and XGBoost showed the best performances, yielding an accuracy of ~85% on the independent test set. In term of probability prediction, SVM and XGBoost were well calibrated. Total gain (TG) from the predictiveness curve was more related to SVM (TG = 0.67) and XGBoost (TG = 0.75). These models also predict the high-affinity compounds from PubChem antimalarial bioassay (as external validation) with a high probability score. Our findings suggest that the selected models are robust and can be potentially useful for facilitating the discovery of antimalarial agents.
本研究以ChEMBL数据库中收录的大量已知且经实验验证的化合物集合为基础,构建了多种用于预测针对恶性疟原虫(*Plasmodium falciparum*)抗疟活性的分类模型。本研究选用支持向量机(SVM)、随机森林(RF)、k近邻(kNN)以及XGBoost这4种机器学习方法,基于ChEMBL数据库中的多样化抗疟数据集构建模型。采用成熟的特征选择框架,从大规模描述符池中筛选得到最优特征子集。通过适用性域分析、Y置换检验以及AUC-ROC曲线分析,对模型性能进行了严格评估。此外,还通过概率校准与预测性曲线分析,评估了模型的预测能力。其中支持向量机与XGBoost表现最优,在独立测试集上的准确率约达85%。就概率预测而言,支持向量机与XGBoost的校准效果良好。预测性曲线的总增益(TG)与支持向量机(TG=0.67)和XGBoost(TG=0.75)的相关性更高。此外,这些模型还能以较高的概率得分,从PubChem抗疟生物测定数据集(作为外部验证集)中筛选出高亲和力化合物。本研究结果表明,所构建的模型稳健性良好,有望为抗疟药物的研发提供助力。
提供机构:
Taylor & Francis
创建时间:
2019-09-05



