QSAR datasets - Meta-QSAR
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/spwgrcnjdg
下载链接
链接失效反馈官方服务:
资源简介:
We extracted 2,219 protein targets from ChEMBL with a diverse number of drug-like chemical compounds, ranging from 30 to about 6,000, each target resulting in a dataset with as many examples as compounds. The datasets were originally used in (Olier et al. Meta-QSAR: a large-scale application of meta-learning to drug design and discovery. Machine Learning, 2018, 107 (1), 285-311). Chemical compounds were intrinsically
described using a standard fingerprint representation (as it is the most commonly used in QSAR learning), where the presence or absence of a particular molecular substructure in a molecule (e.g.
methyl group, benzene ring) is indicated by a Boolean variable. Specifically, we used the RDKit to calculate the 1024 bits FCFP4 fingerprint representation, which is one of the extended-connectivity fingerprints (Rogers and Hahn, 2010) for molecular characterisation. Each dataset consisted of 1,024 input binary variables, one for each fingerprint bit, and one floating-point output variable which represented the chemical compound activities against the target. We used IC50 values, inhibitory drug concentrations at 50%. IC50 value states the concentration of the drug compound that is required to block or inhibit 50% of the proteins. This
response data has been normalised by taking the negative log of the drug concentrations that inhibited 50% of a target (pXC50).
本研究从ChEMBL数据库中提取了2219个蛋白质靶点,每个靶点对应30至约6000不等的类药化合物,每个靶点对应的数据集样本数与所含化合物数量一致。该数据集最初被应用于Olier等人发表于《Machine Learning》2018年第107卷第1期第285-311页的研究《Meta-QSAR:元学习在药物设计与发现中的大规模应用》。类药化合物采用标准指纹表征(这是定量构效关系(Quantitative Structure-Activity Relationship,QSAR)学习中最常用的表征方式),通过布尔变量标记分子中特定分子子结构(如甲基、苯环)的存在与否。具体而言,本研究使用RDKit工具计算得到1024位FCFP4指纹表征,该表征属于扩展连接性指纹的一种,用于分子特征描述,相关研究见于Rogers与Hahn于2010年发表的工作。每个数据集包含1024个输入二进制变量(对应每一个指纹位),以及1个浮点型输出变量,用于表征化合物针对对应靶点的生物活性。本研究采用IC50值作为活性指标,IC50即50%抑制浓度,指能够阻断或抑制50%靶点蛋白所需的药物化合物浓度。该响应数据已完成标准化处理:对抑制50%靶点的药物浓度取负对数,得到标准化值pXC50。
创建时间:
2020-10-30



