TB - 3 combined datasets (TB MLSMR, CB2 and Kinase)

NIAID Data Ecosystem2026-03-08 收录

下载链接：

https://figshare.com/articles/dataset/TB_3_combined_datasets_TB_MLSMR_CB2_and_Kinase_/880644

下载链接

链接失效反馈

官方服务：

资源简介：

combined dataset from TB paper J Chem Inf Model. 2013 Nov 25;53(11):3054-63. doi: 10.1021/ci400480s. Epub 2013 Oct 30. Fusing Dual-Event Data Sets for Mycobacterium tuberculosis Machine Learning Models and Their Evaluation. Ekins S, Freundlich JS, Reynolds RC. Author information Abstract The search for new tuberculosis treatments continues as we need to find molecules that can act more quickly, be accommodated in multidrug regimens, and overcome ever increasing levels of drug resistance. Multiple large scale phenotypic high-throughput screens against Mycobacterium tuberculosis (Mtb) have generated dose response data, enabling the generation of machine learning models. These models also incorporated cytotoxicity data and were recently validated with a large external data set. A cheminformatics data-fusion approach followed by Bayesian machine learning, Support Vector Machine, or Recursive Partitioning model development (based on publicly available Mtb screening data) was used to compare individual data sets and subsequent combined models. A set of 1924 commercially available molecules with promising antitubercular activity (and lack of relative cytotoxicity to Vero cells) were used to evaluate the predictive nature of the models. We demonstrate that combining three data sets incorporating antitubercular and cytotoxicity data in Vero cells from our previous screens results in external validation receiver operator curve (ROC) of 0.83 (Bayesian or RP Forest). Models that do not have the highest 5-fold cross-validation ROC scores can outperform other models in a test set dependent manner. We demonstrate with predictions for a recently published set of Mtb leads from GlaxoSmithKline that no single machine learning model may be enough to identify compounds of interest. Data set fusion represents a further useful strategy for machine learning construction as illustrated with Mtb. Coverage of chemistry and Mtb target spaces may also be limiting factors for the whole-cell screening data generated to date.

源自结核病（Tuberculosis, TB）相关研究论文的组合数据集《化学信息与建模杂志》（J Chem Inf Model），2013年11月25日；第53卷第11期：3054-3063。DOI：10.1021/ci400480s，在线发表于2013年10月30日。《融合双事件数据集构建结核分枝杆菌（Mycobacterium tuberculosis, Mtb）机器学习模型及其评估》作者：Ekins S、Freundlich JS、Reynolds RC 作者信息摘要新型抗结核治疗药物的研发工作仍在持续推进，当前亟需开发起效更快、可适配多药联合治疗方案且能应对不断加剧的耐药性问题的活性分子。目前已有多项针对结核分枝杆菌的大规模表型高通量筛选实验产生了剂量反应数据，为机器学习模型的构建提供了数据支撑。本研究前期构建的相关模型已纳入细胞毒性数据，并通过大型外部数据集完成了验证。本研究采用化学信息学（cheminformatics）数据融合方法，基于公开可得的Mtb筛选数据，分别构建贝叶斯机器学习、支持向量机（Support Vector Machine）及递归划分（Recursive Partitioning）模型，对单个数据集及组合后的模型进行对比分析。我们选取了1924个具有潜在抗结核活性（且对维罗细胞（Vero细胞）相对无细胞毒性）的商用化合物，用于评估模型的预测性能。研究结果表明，融合我们前期筛选得到的3组包含抗结核活性及维罗细胞毒性的数据后，模型在外部验证中的受试者工作特征曲线（Receiver Operating Characteristic, ROC）面积可达0.83（贝叶斯模型或递归划分森林模型）。部分在5折交叉验证中ROC评分并非最高的模型，在测试集上的表现反而优于其他模型。通过对葛兰素史克（GlaxoSmithKline, GSK）近期发表的一组Mtb先导化合物进行预测分析，我们发现单一机器学习模型可能不足以识别所有具有研究价值的化合物。正如本研究针对Mtb的建模结果所示，数据集融合是构建机器学习模型的有效策略。此外，现有全细胞筛选数据在化学空间及Mtb靶点空间的覆盖度可能仍是限制模型性能的关键因素。

创建时间：

2013-12-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集