A Curated, Standardised Database of Non-Nucleoside Reverse Transcriptase Inhibitors (NNRTIs) for QSAR and Machine Learning in HIV Drug Discovery.
收藏DataCite Commons2025-12-17 更新2026-04-25 收录
下载链接:
https://figshare.com/articles/dataset/A_Curated_Standardised_Database_of_Non-Nucleoside_Reverse_Transcriptase_Inhibitors_NNRTIs_for_QSAR_and_Machine_Learning_in_HIV_Drug_Discovery_/30901799
下载链接
链接失效反馈官方服务:
资源简介:
A Systematic Compilation of Data(Newcastle University, School of Pharmacy, UK) <br>Author: Shreya Goswami<br><br>________________________________________AbstractNon-nucleoside reverse transcriptase inhibitors (NNRTIs) are vital elements of combination therapy for HIV-1. Nevertheless, the advancement of robust Quantitative Structure-Activity Relationship (QSAR) and machine learning (ML) models in this domain is often hindered by the lack of centralised, standardised biological activity datasets. This study fills a significant gap by systematically compiling and curating a comprehensive database of NNRTIs, ensuring that the results can be reproduced and standardised. We carefully curated data from primary literature and chemical databases like PubChem and BindingDB. This data included chemical structures, SMILES codes, and quantitative biological activity IC50 and EC50 values. Using ChemDraw, the final dataset was standardised and now has 249 different NNRTI compounds (234 general inhibitors and 15 patented inhibitors). This collection is a high-quality, foundational resource that will enhance the reliability of virtual screening, strengthen Structure-Activity Relationship (SAR) studies, and aid in developing new computational models for anti-HIV drug discovery.1. IntroductionThere is still a strong need for new anti-HIV drugs, especially ones that work against resistant strains. This need is heavily dependent on effective rational design methods. Non-nucleoside reverse transcriptase inhibitors (NNRTIs) work by binding to the viral reverse transcriptase (RT), thereby preventing viral replication. The efficacy of computational drug design (CADD) methodologies, including QSAR and predictive machine learning algorithms, is fundamentally dependent on the quality and structure of the input data. Current public and literature-based NNRTI activity data frequently exhibit fragmentation, inconsistent structural formats, and non-standardised activity units, complicating their integration into robust computational pipelines. The goal of this project was to create a high-quality NNRTI database that researchers could easily use. The goal of this resource is to make computational research more open and reproducible, thereby accelerating the search for and improvement of anti-HIV drug candidates.________________________________________2. Materials and MethodsA) Data Sourcing and Curation ProtocolB) The dataset was created by carefully looking through primary literature and well-known chemical and biological databases, like PubChem and the Binding Database (BindingDB). • Inclusion Criteria: Compounds were chosen if they were proven to have NNRTI activity. • Data Fields: The compiled fields included a unique compound identifier, the source reference paper, the molecular structure, the machine-readable SMILES code, and quantitative biological activity, specifically IC50 and EC50, along with a description of the assay type.C) Data StandardisationTo ensure uniformity and compatibility with computational software, all data underwent rigorous standardisation:1. Chemical Structure Generation: Molecular structures were drawn and standardised using ChemDraw software.2. SMILES Encoding: The Simplified Molecular-Input Line-Entry System (SMILES) codes were generated to provide a standardised, universal format required for molecular descriptor calculation in QSAR/ML workflows.3. Organisation: The resulting data was systematically organised in a spreadsheet format using Microsoft Excel to create a searchable and manageable resource. ________________________________________3. Results and DiscussionA) Content and Features of the Database The organised process yielded a curated database of 249 different NNRTI compounds. The dataset is split up for the best use: • General NNRTI Library: 234 compounds in this set can be used to train and test predictive models on a large scale. Specific inhibitors also have RNAase H/H/Integrase Strand Transfer IC50 mentioned in the comments. • Patented Agents Subset: a group of 15 compounds taken from patent information that can be used to study intellectual property and commercially valuable scaffolds.B) Applications in Computational Drug DiscoveryThis compiled dataset represents a significant step forward in facilitating CADD for HIV therapeutics.> Enhanced Reproducibility: By providing standardised chemical identifiers (SMILES) and curated activity data, the resource directly supports computational reproducibility in the molecular docking and virtual screening community.> QSAR and ML Modelling: The diversity of the compounds and completeness of the activity data make the dataset highly valuable for developing SAR/QSAR modelling efforts and training advanced machine learning modelsC) Constraints: The main constraint pertains to the intrinsic variability of assay conditions among various data sources. Although attempts were made to document assay details, it is essential to utilise meticulous data normalisation or filtering during quantitative regression analyses. ________________________________________4) Conclusions The meticulously organised HIV-RT NNRTI database serves as an essential, curated resource for those engaged in drug discovery efforts. This resource will aid in the development of reliable computational pipelines, enabling the rapid identification and optimisation of future anti-HIV therapeutics.<br><br><br><br><br>
提供机构:
figshare
创建时间:
2025-12-17



