Improving (Q)SAR predictions by examining bias in the selection of compounds for experimental testing
收藏DataCite Commons2020-08-26 更新2024-07-27 收录
下载链接:
https://tandf.figshare.com/articles/Improving_Q_SAR_predictions_by_examining_bias_in_the_selection_of_compounds_for_experimental_testing/9895979/1
下载链接
链接失效反馈官方服务:
资源简介:
Existing data on structures and biological activities are limited and distributed unevenly across distinct molecular targets and chemical compounds. The question arises if these data represent an unbiased sample of the general population of chemical-biological interactions. To answer this question, we analyzed ChEMBL data for 87,583 molecules tested against 919 protein targets using supervised and unsupervised approaches. Hierarchical clustering of the Murcko frameworks generated using Chemistry Development Toolkit showed that the available data form a big diffuse cloud without apparent structure. In contrast hereto, PASS-based classifiers allowed prediction whether the compound had been tested against the particular molecular target, despite whether it was active or not. Thus, one may conclude that the selection of chemical compounds for testing against specific targets is biased, probably due to the influence of prior knowledge. We assessed the possibility to improve (Q)SAR predictions using this fact: PASS prediction of the interaction with the particular target for compounds predicted as tested against the target has significantly higher accuracy than for those predicted as untested (average ROC AUC are about 0.87 and 0.75, respectively). Thus, considering the existing bias in the data of the training set may increase the performance of virtual screening.
当前关于化合物结构与生物活性的数据集存在规模有限、且在不同分子靶点与化学小分子间分布不均的问题。由此引发一个核心研究问题:现有数据集是否能代表化学生物相互作用全域的无偏抽样样本?
为解答该问题,本研究采用有监督与无监督学习方法,对ChEMBL数据库中针对919个蛋白靶点开展活性测试的87583个小分子数据展开分析。利用化学开发工具包(Chemistry Development Toolkit)生成的穆尔科骨架(Murcko frameworks)层级聚类结果显示,现有数据整体呈现为一片无明显结构的弥散云团。
与此形成鲜明对比的是,基于PASS的分类器能够准确预测某小分子是否已针对特定分子靶点开展过活性测试,且该预测不受小分子对靶点是否具有活性的影响。由此可得出结论:针对特定靶点筛选待测试小分子的过程存在偏倚,该偏倚大概率源于先验知识的影响。
本研究利用该偏倚特征评估了提升定量构效关系((Q)SAR)预测性能的可行性:相较于被预测为未测试的小分子,针对已被预测为针对某靶点开展过测试的小分子,PASS对其与该靶点相互作用的预测精度显著更高(平均受试者工作特征曲线下面积(ROC AUC)分别约为0.87与0.75)。因此,在训练集数据中纳入现有偏倚信息,有望提升虚拟筛选的模型性能。
提供机构:
Taylor & Francis
创建时间:
2019-09-24



