IFPTML Multi-Output Model for Anti-Retroviral Compounds Including the Drug Structure and Target Protein Sequence Information

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://figshare.com/articles/dataset/IFPTML_Multi-Output_Model_for_Anti-Retroviral_Compounds_Including_the_Drug_Structure_and_Target_Protein_Sequence_Information/28882284

下载链接

链接失效反馈

官方服务：

资源简介：

Retroviruses such as HIV cause significant diseases in humans and other organisms, making the discovery of antiretroviral (ARV) drugs a critical priority. While databases like ChEMBL contain valuable information, their complexity poses challenges. The data set includes approximately >140,000 assays across eight viruses, encompassing >350 biological activity parameters, >50 target proteins, >80 cell lines, >60 assay organisms, and >770 viral strains. Artificial Intelligence/Machine Learning (AI/ML) models offer a promising approach to accelerate ARV discovery. Recently, we developed AI/ML models for ChEMBL ARV data using the Information Fusion Perturbation Theory and Machine Learning (IFPTML) strategy. However, neither existing AI/ML models nor our prior IFPTML implementation simultaneously incorporates viral protein sequences, strains, cell lines, assay organisms, or virus/human mutations. This limitation renders them ineffective for predicting activity against amino acid sequence variations (e.g., mutations, variants, or emerging strains)a critical shortcoming given the well-documented prevalence of drug-resistance mutations in marketed ARVs. In this work, we present an enhanced IFPTML model integrating protein sequence descriptors. We computed and incorporated sequence descriptors for all drug target proteins in ChEMBL, derived from proteomes of retroviruses (HIV, FeLV, MMV, SIV, etc.). The model demonstrated robust performance, with sensitivity (Sn), specificity (Sp), and accuracy (Ac) values ranging between 72.0 and 88.0% in both training and validation phases. We analyze its predictions for protein mutations documented in ChEMBL and other literature sources. To our knowledge, this represents the first unified multicondition, multioutput model for ARV discovery that systematically accounts for protein sequence information.

以人类免疫缺陷病毒（HIV）为代表的逆转录病毒会在人类及其他生物体中引发严重疾病，因此抗逆转录病毒药物（antiretroviral, ARV）的研发始终是关键优先事项。尽管ChEMBL等数据库蕴含宝贵的研究数据，但其复杂的结构也带来了诸多挑战。本数据集涵盖8种病毒的逾14万个活性检测实验，包含超350项生物活性参数、50余种药物靶蛋白、80余类细胞系、60余种检测用生物体以及770余株病毒。人工智能/机器学习（Artificial Intelligence/Machine Learning, AI/ML）模型为加速抗逆转录病毒药物研发提供了极具前景的路径。此前我们基于信息融合扰动理论与机器学习（Information Fusion Perturbation Theory and Machine Learning, IFPTML）策略，构建了针对ChEMBL数据库中抗逆转录病毒数据的AI/ML模型。但现有AI/ML模型乃至我们此前的IFPTML实现，均无法同时整合病毒蛋白序列、病毒株、细胞系、检测用生物体以及病毒/人类突变信息。这一局限使得它们无法有效预测针对氨基酸序列变异（如突变、变体或新兴毒株）的药物活性——鉴于市售抗逆转录病毒药物中已被广泛证实存在耐药突变，这一缺陷堪称关键性短板。本研究中，我们提出了一款整合蛋白序列描述符的增强型IFPTML模型。我们针对ChEMBL数据库中所有逆转录病毒（HIV、FeLV、MMV、SIV等）的蛋白质组，计算并引入了对应药物靶蛋白的序列描述符。该模型表现出稳健的性能，训练与验证阶段的灵敏度（Sn）、特异度（Sp）与准确率（Ac）均介于72.0%至88.0%之间。我们还分析了该模型针对ChEMBL及其他文献中记载的蛋白突变的预测结果。据我们所知，这是首款系统性整合蛋白序列信息的多条件、多输出统一模型，可用于抗逆转录病毒药物研发。

创建时间：

2025-04-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集