Well-curated QSAR datasets for diverse protein targets

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://figshare.com/articles/dataset/Well-curated_QSAR_datasets_for_diverse_protein_targets/20539893

下载链接

链接失效反馈

官方服务：

资源简介：

High-throughput screening (HTS) is the use of automated equipment to rapidly screen thousands to millions of molecules for the biological activity of interest in the early drug discovery process. However, this brute-force approach has low hit rates, typically around 0.05\%-0.5\%. Meanwhile, PubChem is a database supported by the National Institute of Health (NIH) that contains biological activities for millions of drug-like molecules, often from HTS experiments. However, the raw primary screening data from the PubChem have a high false positive rate. A series of secondary experimental screens on putative actives is used to remove these. While all relevant screens are linked, the datasets of molecules are often not curated to list all inactive molecules from the primary HTS and only confirmed actives after secondary screening. Thus, we identified nine high-quality HTS experiments in PubChem covering all important target protein classes for drug discovery. We carefully curated these datasets to have lists of inactive and confirmed active molecules. We preprocessed the input SMIELS strings to Structure-Data Files (SDFs). The dataset is specified by its PubChem Accession Identifier. Prepossessing to the original data includes converting SMILES strings to 3D SDF files, generating 3D conformation, and filtering. Conversion from SMILES to SDF files is done using Open Babel, version 2.4.1. Conformations are generated using Corina, version 4.3. Molecules are further filtered with validity, duplicates with BioChemical Library (BCL)

创建时间：

2022-08-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集