QSAR datasets - Meta-QSAR

Name: QSAR datasets - Meta-QSAR
Creator: Mendeley
Published: 2025-05-01 06:26:05
License: 暂无描述

DataCite Commons2025-05-01 更新2025-05-17 收录

下载链接：

https://data.mendeley.com/datasets/spwgrcnjdg

下载链接

链接失效反馈

官方服务：

资源简介：

We extracted 2,219 protein targets from ChEMBL with a diverse number of drug-like chemical compounds, ranging from 30 to about 6,000, each target resulting in a dataset with as many examples as compounds. The datasets were originally used in (Olier et al. Meta-QSAR: a large-scale application of meta-learning to drug design and discovery. Machine Learning, 2018, 107 (1), 285-311). Chemical compounds were intrinsically described using a standard fingerprint representation (as it is the most commonly used in QSAR learning), where the presence or absence of a particular molecular substructure in a molecule (e.g. methyl group, benzene ring) is indicated by a Boolean variable. Specifically, we used the RDKit to calculate the 1024 bits FCFP4 fingerprint representation, which is one of the extended-connectivity fingerprints (Rogers and Hahn, 2010) for molecular characterisation. Each dataset consisted of 1,024 input binary variables, one for each fingerprint bit, and one floating-point output variable which represented the chemical compound activities against the target. We used IC50 values, inhibitory drug concentrations at 50%. IC50 value states the concentration of the drug compound that is required to block or inhibit 50% of the proteins. This response data has been normalised by taking the negative log of the drug concentrations that inhibited 50% of a target (pXC50).

我们从ChEMBL数据库中提取了2219个蛋白质靶点，每个靶点对应数量不等的类药化合物（数量区间为30至约6000），每个靶点对应生成一个样本量与该靶点化合物数量相等的数据集。该数据集最初被应用于Olier等人在《Machine Learning》期刊2018年第107卷第1期第285至311页发表的研究论文《Meta-QSAR: a large-scale application of meta-learning to drug design and discovery》（中文译名：《Meta-QSAR：元学习在药物设计与发现中的大规模应用》）。化学化合物通过标准指纹表征方式进行内在描述，这是定量构效关系（Quantitative Structure-Activity Relationship, QSAR）学习中最常用的表征手段，通过布尔变量标记分子中特定分子子结构（例如甲基、苯环）的存在与否。具体而言，我们使用RDKit工具包计算得到1024位FCFP4指纹表征，该表征属于用于分子表征的扩展连通性指纹（extended-connectivity fingerprints）类别（Rogers与Hahn，2010）。每个数据集包含1024个与每个指纹位一一对应的输入二进制变量，以及一个浮点型输出变量，用于表征化合物针对对应靶点的活性。我们采用了半数抑制浓度（Half Maximal Inhibitory Concentration, IC50）值，即能够阻断或抑制50%靶蛋白的药物化合物浓度。该响应数据已通过对抑制靶点50%活性的药物浓度取负对数进行归一化处理，得到pXC50值。

提供机构：

Mendeley

创建时间：

2020-10-30

搜集汇总

数据集介绍