IFPTML Multi-Output Model for Anti-Retroviral Compounds Including the Drug Structure and Target Protein Sequence Information
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/IFPTML_Multi-Output_Model_for_Anti-Retroviral_Compounds_Including_the_Drug_Structure_and_Target_Protein_Sequence_Information/28882290
下载链接
链接失效反馈官方服务:
资源简介:
Retroviruses such as HIV cause significant diseases in
humans and
other organisms, making the discovery of antiretroviral (ARV) drugs
a critical priority. While databases like ChEMBL contain valuable
information, their complexity poses challenges. The data set includes
approximately >140,000 assays across eight viruses, encompassing
>350
biological activity parameters, >50 target proteins, >80 cell
lines,
>60 assay organisms, and >770 viral strains. Artificial Intelligence/Machine
Learning (AI/ML) models offer a promising approach to accelerate ARV
discovery. Recently, we developed AI/ML models for ChEMBL ARV data
using the Information Fusion Perturbation Theory and Machine Learning
(IFPTML) strategy. However, neither existing AI/ML models nor our
prior IFPTML implementation simultaneously incorporates viral protein
sequences, strains, cell lines, assay organisms, or virus/human mutations.
This limitation renders them ineffective for predicting activity against
amino acid sequence variations (e.g., mutations, variants, or emerging
strains)a critical shortcoming given the well-documented prevalence
of drug-resistance mutations in marketed ARVs. In this work, we present
an enhanced IFPTML model integrating protein sequence descriptors.
We computed and incorporated sequence descriptors for all drug target
proteins in ChEMBL, derived from proteomes of retroviruses (HIV, FeLV,
MMV, SIV, etc.). The model demonstrated robust performance, with sensitivity
(Sn), specificity (Sp), and accuracy (Ac) values ranging between 72.0
and 88.0% in both training and validation phases. We analyze its predictions
for protein mutations documented in ChEMBL and other literature sources.
To our knowledge, this represents the first unified multicondition,
multioutput model for ARV discovery that systematically accounts for
protein sequence information.
创建时间:
2025-04-28



