Combining Structural Modeling with Ensemble Machine Learning to Accurately Predict Protein Fold Stability and Binding Affinity Effects upon Mutation

NIAID Data Ecosystem2026-03-09 收录

下载链接：

https://figshare.com/articles/dataset/_Combining_Structural_Modeling_with_Ensemble_Machine_Learning_to_Accurately_Predict_Protein_Fold_Stability_and_Binding_Affinity_Effects_upon_Mutation_/1177728

下载链接

链接失效反馈

官方服务：

资源简介：

Advances in sequencing have led to a rapid accumulation of mutations, some of which are associated with diseases. However, to draw mechanistic conclusions, a biochemical understanding of these mutations is necessary. For coding mutations, accurate prediction of significant changes in either the stability of proteins or their affinity to their binding partners is required. Traditional methods have used semi-empirical force fields, while newer methods employ machine learning of sequence and structural features. Here, we show how combining both of these approaches leads to a marked boost in accuracy. We introduce ELASPIC, a novel ensemble machine learning approach that is able to predict stability effects upon mutation in both, domain cores and domain-domain interfaces. We combine semi-empirical energy terms, sequence conservation, and a wide variety of molecular details with a Stochastic Gradient Boosting of Decision Trees (SGB-DT) algorithm. The accuracy of our predictions surpasses existing methods by a considerable margin, achieving correlation coefficients of 0.77 for stability, and 0.75 for affinity predictions. Notably, we integrated homology modeling to enable proteome-wide prediction and show that accurate prediction on modeled structures is possible. Lastly, ELASPIC showed significant differences between various types of disease-associated mutations, as well as between disease and common neutral mutations. Unlike pure sequence-based prediction methods that try to predict phenotypic effects of mutations, our predictions unravel the molecular details governing the protein instability, and help us better understand the molecular causes of diseases.

测序技术的进步使得突变数据快速积累，其中部分突变与疾病相关。然而，若要得出机制性结论，需对这些突变开展生化层面的解析。针对编码区突变（coding mutation），需精准预测蛋白质稳定性或其与结合伴侣的亲和力所发生的显著变化。传统方法多采用半经验力场（semi-empirical force fields），而新兴方法则基于序列与结构特征开展机器学习（machine learning）建模。本研究证实，将这两类方法相结合可显著提升预测精度。我们提出ELASPIC——一种全新的集成机器学习（ensemble machine learning）方法，可预测突变对蛋白质结构域核心区与结构域-结构域界面的稳定性影响。该方法将半经验能量项、序列保守性、多种分子细节信息，与决策树随机梯度提升（Stochastic Gradient Boosting of Decision Trees, SGB-DT）算法相结合。其预测精度远超现有同类方法，稳定性预测的相关系数可达0.77，亲和力预测的相关系数可达0.75。值得注意的是，我们整合了同源建模（homology modeling）技术以实现全蛋白质组范围的预测，并证实基于建模结构开展精准预测具备可行性。最后，ELASPIC能够区分不同类型的疾病相关突变，以及疾病相关突变与常见中性突变之间的差异。与仅基于序列、旨在预测突变表型效应的传统方法不同，本方法可揭示调控蛋白质不稳定的分子细节，助力我们更深入地理解疾病的分子致病机制。

创建时间：

2016-01-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集