A Curated, Standardised Database of Non-Nucleoside Reverse Transcriptase Inhibitors (NNRTIs) for QSAR and Machine Learning in HIV Drug Discovery.
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/A_Curated_Standardised_Database_of_Non-Nucleoside_Reverse_Transcriptase_Inhibitors_NNRTIs_for_QSAR_and_Machine_Learning_in_HIV_Drug_Discovery_/30901799
下载链接
链接失效反馈官方服务:
资源简介:
A Systematic Compilation of Data
(Newcastle University, School of Pharmacy, UK)
Author: Shreya Goswami
________________________________________
Abstract
Non-nucleoside reverse transcriptase inhibitors (NNRTIs) are vital elements of combination therapy for HIV-1. Nevertheless, the advancement of robust Quantitative Structure-Activity Relationship (QSAR) and machine learning (ML) models in this domain is often hindered by the lack of centralised, standardised biological activity datasets. This study fills a significant gap by systematically compiling and curating a comprehensive database of NNRTIs, ensuring that the results can be reproduced and standardised. We carefully curated data from primary literature and chemical databases like PubChem and BindingDB. This data included chemical structures, SMILES codes, and quantitative biological activity IC50 and EC50 values. Using ChemDraw, the final dataset was standardised and now has 249 different NNRTI compounds (234 general inhibitors and 15 patented inhibitors). This collection is a high-quality, foundational resource that will enhance the reliability of virtual screening, strengthen Structure-Activity Relationship (SAR) studies, and aid in developing new computational models for anti-HIV drug discovery.
1. Introduction
There is still a strong need for new anti-HIV drugs, especially ones that work against resistant strains. This need is heavily dependent on effective rational design methods. Non-nucleoside reverse transcriptase inhibitors (NNRTIs) work by binding to the viral reverse transcriptase (RT), thereby preventing viral replication.
The efficacy of computational drug design (CADD) methodologies, including QSAR and predictive machine learning algorithms, is fundamentally dependent on the quality and structure of the input data. Current public and literature-based NNRTI activity data frequently exhibit fragmentation, inconsistent structural formats, and non-standardised activity units, complicating their integration into robust computational pipelines.
The goal of this project was to create a high-quality NNRTI database that researchers could easily use. The goal of this resource is to make computational research more open and reproducible, thereby accelerating the search for and improvement of anti-HIV drug candidates.
________________________________________
2. Materials and Methods
A) Data Sourcing and Curation Protocol
B) The dataset was created by carefully looking through primary literature and well-known chemical and biological databases, like PubChem and the Binding Database (BindingDB).
• Inclusion Criteria: Compounds were chosen if they were proven to have NNRTI activity.
• Data Fields: The compiled fields included a unique compound identifier, the source reference paper, the molecular structure, the machine-readable SMILES code, and quantitative biological activity, specifically IC50 and EC50, along with a description of the assay type.
C) Data Standardisation
To ensure uniformity and compatibility with computational software, all data underwent rigorous standardisation:
1. Chemical Structure Generation: Molecular structures were drawn and standardised using ChemDraw software.
2. SMILES Encoding: The Simplified Molecular-Input Line-Entry System (SMILES) codes were generated to provide a standardised, universal format required for molecular descriptor calculation in QSAR/ML workflows.
3. Organisation: The resulting data was systematically organised in a spreadsheet format using Microsoft Excel to create a searchable and manageable resource. ________________________________________
3. Results and Discussion
A) Content and Features of the Database
The organised process yielded a curated database of 249 different NNRTI compounds. The dataset is split up for the best use:
• General NNRTI Library: 234 compounds in this set can be used to train and test predictive models on a large scale. Specific inhibitors also have RNAase H/H/Integrase Strand Transfer IC50 mentioned in the comments.
• Patented Agents Subset: a group of 15 compounds taken from patent information that can be used to study intellectual property and commercially valuable scaffolds.
B) Applications in Computational Drug Discovery
This compiled dataset represents a significant step forward in facilitating CADD for HIV therapeutics.
> Enhanced Reproducibility: By providing standardised chemical identifiers (SMILES) and curated activity data, the resource directly supports computational reproducibility in the molecular docking and virtual screening community.
> QSAR and ML Modelling: The diversity of the compounds and completeness of the activity data make the dataset highly valuable for developing SAR/QSAR modelling efforts and training advanced machine learning models
C) Constraints: The main constraint pertains to the intrinsic variability of assay conditions among various data sources. Although attempts were made to document assay details, it is essential to utilise meticulous data normalisation or filtering during quantitative regression analyses. ________________________________________
4) Conclusions
The meticulously organised HIV-RT NNRTI database serves as an essential, curated resource for those engaged in drug discovery efforts. This resource will aid in the development of reliable computational pipelines, enabling the rapid identification and optimisation of future anti-HIV therapeutics.
# 数据集系统汇编
(英国纽卡斯尔大学药学院)
作者:施蕾亚·戈斯瓦米(Shreya Goswami)
## 摘要
非核苷类逆转录酶抑制剂(Non-nucleoside reverse transcriptase inhibitors, NNRTIs)是HIV-1联合治疗的关键组成部分。然而,该领域内稳健的定量构效关系(Quantitative Structure-Activity Relationship, QSAR)与机器学习(Machine Learning, ML)模型的开发,常因缺乏集中化、标准化的生物活性数据集而受阻。本研究填补了这一重要空白,系统汇编并精心整理了一套完整的NNRTIs数据库,确保研究结果可复现且具备标准化特性。研究团队从原始文献与PubChem、BindingDB等化学数据库中筛选数据,涵盖化学结构、简化分子线性输入规范(Simplified Molecular-Input Line-Entry System, SMILES)编码,以及定量生物活性IC₅₀与EC₅₀值。借助ChemDraw软件完成最终数据集的标准化处理,当前数据集共包含249种不同的NNRTI化合物(其中234种为通用抑制剂,15种为专利抑制剂)。该数据集是一套高质量的基础性资源,可提升虚拟筛选的可靠性,强化构效关系(Structure-Activity Relationship, SAR)研究,并助力开发用于抗HIV药物发现的新型计算模型。
## 1. 引言
当前抗HIV药物仍存在迫切的研发需求,尤其是针对耐药毒株的新型药物,这一需求高度依赖高效的理性设计方法。非核苷类逆转录酶抑制剂(NNRTIs)通过结合病毒逆转录酶(Reverse Transcriptase, RT)阻断病毒复制过程。
计算机辅助药物设计(Computer-Aided Drug Design, CADD)方法(包括QSAR与预测性机器学习算法)的有效性,本质上取决于输入数据的质量与结构。当前公开及基于文献的NNRTI活性数据普遍存在碎片化、结构格式不统一、活性单位未标准化等问题,阻碍了其在稳健计算流程中的整合应用。
本项目旨在构建一套高质量、易于科研人员使用的NNRTI数据库。该资源的目标是推动计算研究的开放性与可复现性,进而加速抗HIV候选药物的筛选与优化进程。
## 2. 材料与方法
### A) 数据获取与整理流程
本数据集通过系统检索原始文献及PubChem、BindingDB等知名化学与生物数据库构建而成。
• 纳入标准:仅纳入经证实具备NNRTI活性的化合物。
• 数据字段:汇编的数据字段包括唯一化合物标识符、来源参考文献、分子结构、机器可读的SMILES编码,以及定量生物活性指标(具体为IC₅₀与EC₅₀),同时包含试验类型说明。
### C) 数据标准化处理
为确保数据的统一性并适配计算软件的要求,所有数据均经过严格的标准化流程:
1. 化学结构生成:借助ChemDraw软件绘制并标准化分子结构。
2. SMILES编码:生成简化分子线性输入规范(SMILES)编码,为QSAR/ML工作流中的分子描述符计算提供标准化的通用格式。
3. 数据组织:最终数据以电子表格格式进行系统化整理,使用Microsoft Excel构建可检索且易于管理的资源库。
## 3. 结果与讨论
### A) 数据库的内容与特性
通过系统化整理流程,最终得到包含249种NNRTI化合物的精选数据库。为实现最优使用,数据集被划分为两个子集:
• 通用NNRTI库:该子集包含234种化合物,可用于大规模预测模型的训练与测试。部分特异性抑制剂的注释中还包含核糖核酸酶H/H/整合酶链转移IC₅₀值。
• 专利化合物子集:包含15种源自专利文献的化合物,可用于知识产权研究与高价值母核结构的探索。
### B) 在计算机辅助药物发现中的应用价值
本汇编数据集为HIV治疗药物的计算机辅助设计(CADD)研究迈出了重要一步。
> 提升可复现性:通过提供标准化的化学标识符(SMILES)与整理后的活性数据,该资源直接支持分子对接与虚拟筛选领域的计算可复现性。
> QSAR与ML建模:化合物的多样性与活性数据的完整性,使得该数据集在开展SAR/QSAR建模研究与训练先进机器学习模型方面具备极高价值。
### C) 局限性:主要局限性源于不同数据源间试验条件的固有差异。尽管已尝试记录试验细节,但在进行定量回归分析时,仍需采用严谨的数据归一化或筛选策略。
## 4. 结论
本精心整理的HIV-RT NNRTI数据库,为从事药物发现研究的人员提供了一套不可或缺的精选资源。该资源将助力构建可靠的计算流程,实现未来抗HIV治疗药物的快速筛选与优化。
创建时间:
2025-12-17



