High performance, large chemical coverage or both: DanishQSAR and hierarchies of post-hoc ensemble models optimized for sensitivity, specificity or balanced accuracy

Name: High performance, large chemical coverage or both: DanishQSAR and hierarchies of post-hoc ensemble models optimized for sensitivity, specificity or balanced accuracy
Creator: Taylor & Francis
Published: 2025-06-17 08:40:25
License: 暂无描述

DataCite Commons2025-06-17 更新2025-09-08 收录

下载链接：

https://tandf.figshare.com/articles/dataset/High_performance_large_chemical_coverage_or_both_DanishQSAR_and_hierarchies_of_post-hoc_ensemble_models_optimized_for_sensitivity_specificity_or_balanced_accuracy/29234502/1

下载链接

链接失效反馈

官方服务：

资源简介：

The trade-off between applicability domain size and prediction accuracy is a well-known phenomenon in QSAR. We have developed a modelling approach where multiple models with different applicability domain sizes and with different prediction accuracy are selected instead of a single best model. This approach is implemented in DanishQSAR, a new software for binary classification QSAR modelling, integrating descriptor calculation, descriptor selection, model development, validation and application. The various methods and options available in the software are automatically tested and efficiently combined during model development using a version of cross-validation-based grid search and post-hoc ensemble modelling. The resulting large and diverse pool of model candidates is then analysed to generate three hierarchies of models, optimized for sensitivity, specificity or balanced accuracy, respectively, for minimum to maximum coverage levels. When predicting a query compound, the system provides predictions from all models in the three hierarchies, at all coverage levels with user-defined steps, together with the individual model predictivity performances, producing a prediction profile rather than one prediction from a single model. Twenty data sets from the Danish (Q)SAR Database (https://qsar.food.dtu.dk) are used to demonstrate the performance. The developed binary classification models are highly accurate by cross- and external validation.

定量构效关系（Quantitative Structure-Activity Relationship，QSAR）领域中，适用域（applicability domain）规模与预测精度之间的权衡是学界公认的现象。本研究开发了一种新型建模思路：摒弃仅选取单一最优模型的传统范式，转而遴选多个具备不同适用域规模与预测精度的模型。该思路已落地于DanishQSAR——一款面向二分类QSAR建模的新型软件，其整合了描述符（descriptor）计算、描述符筛选、模型构建、验证与应用全流程环节。在模型构建阶段，该软件内置的各类方法与参数选项将通过基于交叉验证的网格搜索与事后集成建模变体方案，实现自动测试与高效组合。随后，针对生成的规模庞大、类型多样的候选模型池开展分析，构建三类模型层级：分别针对灵敏度、特异性或平衡准确率进行优化，覆盖从最低到最高的覆盖水平区间。当对查询化合物进行预测时，系统将输出三类模型层级下所有覆盖水平（步长可由用户自定义）的全部模型的预测结果，同时附带各单模型的预测性能指标，最终生成预测轮廓而非单一模型的单次预测结果。本研究采用丹麦（Q）SAR数据库（https://qsar.food.dtu.dk）中的20个数据集对模型性能进行演示验证。经交叉验证与外部验证表明，本研究开发的二分类模型具备优异的预测精度。

提供机构：

Taylor & Francis

创建时间：

2025-06-04