five

QSAR datasets - Meta-QSAR

收藏
DataCite Commons2025-05-01 更新2025-05-17 收录
下载链接:
https://data.mendeley.com/datasets/spwgrcnjdg
下载链接
链接失效反馈
官方服务:
资源简介:
We extracted 2,219 protein targets from ChEMBL with a diverse number of drug-like chemical compounds, ranging from 30 to about 6,000, each target resulting in a dataset with as many examples as compounds. The datasets were originally used in (Olier et al. Meta-QSAR: a large-scale application of meta-learning to drug design and discovery. Machine Learning, 2018, 107 (1), 285-311). Chemical compounds were intrinsically described using a standard fingerprint representation (as it is the most commonly used in QSAR learning), where the presence or absence of a particular molecular substructure in a molecule (e.g. methyl group, benzene ring) is indicated by a Boolean variable. Specifically, we used the RDKit to calculate the 1024 bits FCFP4 fingerprint representation, which is one of the extended-connectivity fingerprints (Rogers and Hahn, 2010) for molecular characterisation. Each dataset consisted of 1,024 input binary variables, one for each fingerprint bit, and one floating-point output variable which represented the chemical compound activities against the target. We used IC50 values, inhibitory drug concentrations at 50%. IC50 value states the concentration of the drug compound that is required to block or inhibit 50% of the proteins. This response data has been normalised by taking the negative log of the drug concentrations that inhibited 50% of a target (pXC50).

我们从ChEMBL数据库中提取了2219个蛋白质靶点,每个靶点对应数量不等的类药化合物(数量区间为30至约6000),每个靶点对应生成一个样本量与该靶点化合物数量相等的数据集。该数据集最初被应用于Olier等人在《Machine Learning》期刊2018年第107卷第1期第285至311页发表的研究论文《Meta-QSAR: a large-scale application of meta-learning to drug design and discovery》(中文译名:《Meta-QSAR:元学习在药物设计与发现中的大规模应用》)。化学化合物通过标准指纹表征方式进行内在描述,这是定量构效关系(Quantitative Structure-Activity Relationship, QSAR)学习中最常用的表征手段,通过布尔变量标记分子中特定分子子结构(例如甲基、苯环)的存在与否。具体而言,我们使用RDKit工具包计算得到1024位FCFP4指纹表征,该表征属于用于分子表征的扩展连通性指纹(extended-connectivity fingerprints)类别(Rogers与Hahn,2010)。每个数据集包含1024个与每个指纹位一一对应的输入二进制变量,以及一个浮点型输出变量,用于表征化合物针对对应靶点的活性。我们采用了半数抑制浓度(Half Maximal Inhibitory Concentration, IC50)值,即能够阻断或抑制50%靶蛋白的药物化合物浓度。该响应数据已通过对抑制靶点50%活性的药物浓度取负对数进行归一化处理,得到pXC50值。
提供机构:
Mendeley
创建时间:
2020-10-30
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集包含2,219个蛋白质靶标的药物样化合物数据,每个化合物通过1024位FCFP4指纹表示分子结构,并以IC50值衡量其活性。数据适用于药物设计和发现的定量构效关系(QSAR)研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作