Development and rigorous validation of antimalarial predictive models using machine learning approaches

Name: Development and rigorous validation of antimalarial predictive models using machine learning approaches
Creator: Taylor & Francis
Published: 2020-08-26 22:26:51
License: 暂无描述

DataCite Commons2020-08-26 更新2024-07-27 收录

下载链接：

https://tandf.figshare.com/articles/Development_and_rigorous_validation_of_antimalarial_predictive_models_using_machine_learning_approaches/8975951

下载链接

链接失效反馈

官方服务：

资源简介：

The large collection of known and experimentally verified compounds from the ChEMBL database was used to build different classification models for predicting the antimalarial activity against <i>Plasmodium falciparum</i>. Four different machine learning methods, namely the support vector machine (SVM), random forest (RF), k-nearest neighbour (kNN) and XGBoost have been used for the development of models using the diverse antimalarial dataset from ChEMBL. A well-established feature selection framework was used to select the best subset from a larger pool of descriptors. Performance of the models was rigorously evaluated by evaluation of the applicability domain, Y-scrambling and AUC-ROC curve. Additionally, the predictive power of the models was also assessed using probability calibration and predictiveness curves. SVM and XGBoost showed the best performances, yielding an accuracy of ~85% on the independent test set. In term of probability prediction, SVM and XGBoost were well calibrated. Total gain (TG) from the predictiveness curve was more related to SVM (TG = 0.67) and XGBoost (TG = 0.75). These models also predict the high-affinity compounds from PubChem antimalarial bioassay (as external validation) with a high probability score. Our findings suggest that the selected models are robust and can be potentially useful for facilitating the discovery of antimalarial agents.

本研究以ChEMBL数据库中收录的大量已知且经实验验证的化合物集合为基础，构建了多种用于预测针对恶性疟原虫（*Plasmodium falciparum*）抗疟活性的分类模型。本研究选用支持向量机（SVM）、随机森林（RF）、k近邻（kNN）以及XGBoost这4种机器学习方法，基于ChEMBL数据库中的多样化抗疟数据集构建模型。采用成熟的特征选择框架，从大规模描述符池中筛选得到最优特征子集。通过适用性域分析、Y置换检验以及AUC-ROC曲线分析，对模型性能进行了严格评估。此外，还通过概率校准与预测性曲线分析，评估了模型的预测能力。其中支持向量机与XGBoost表现最优，在独立测试集上的准确率约达85%。就概率预测而言，支持向量机与XGBoost的校准效果良好。预测性曲线的总增益（TG）与支持向量机（TG=0.67）和XGBoost（TG=0.75）的相关性更高。此外，这些模型还能以较高的概率得分，从PubChem抗疟生物测定数据集（作为外部验证集）中筛选出高亲和力化合物。本研究结果表明，所构建的模型稳健性良好，有望为抗疟药物的研发提供助力。

提供机构：

Taylor & Francis

创建时间：

2019-09-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集