PHACE: Phylogeny-Aware Co-Evolution

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14038142

下载链接

链接失效反馈

官方服务：

资源简介：

The co-evolution trends of amino acids within or between genes offer valuable insights into protein structure and function. Existing tools for uncovering co-evolutionary signals primarily rely on multiple sequence alignments (MSAs), often neglecting considerations of phylogenetic relatedness and shared evolutionary history. Here, we present a novel approach based on the substitution mapping of amino acid changes onto the phylogenetic tree. We categorize amino acids into two groups: 'tolerable' and 'intolerable,' and assign them to each position based on the position dynamics concerning the observed amino acids. Amino acids deemed 'tolerable' are those observed phylogenetically independently and multiple times at a specific position, signifying the position's tolerance to that alteration. Gaps are regarded as a third character type, and we only consider phylogenetically independent altered gap characters. Our algorithm is based on a tree traversal process through the nodes and computes the total amount of substitution per branch based on the probability differences of two groups of amino acids and gaps between neighboring nodes. We employ an MSA-masking approach to mitigate misleading artifacts from poorly aligned regions. When compared to tools utilizing phylogeny (e.g., CAPS and CoMap) and state-of-the-art MSA-based approaches (DCA, GaussDCA, PSICOV, and MIp), our method exhibits significantly superior accuracy in identifying co-evolving position pairs, as measured by statistical metrics including MCC, AUC, and F1 score. PHACE's success arises from its ability to consider the frequently neglected phylogenetic dependency.

基因内部或基因间的氨基酸共进化趋势，可为蛋白质结构与功能研究提供极具价值的科学洞见。当前用于挖掘共进化信号的工具，主要依赖多序列比对（multiple sequence alignments, MSA），但往往忽略了系统发育相关性与共享进化历史的考量。本文提出一种全新方法，其基于氨基酸变化在系统发育树上的替换映射分析。我们将氨基酸划分为‘可耐受’与‘不可耐受’两类，并依据特定位点的观测氨基酸动态特征为每个位点分配类别。被判定为‘可耐受’的氨基酸，指那些在系统发育层面独立发生、且在特定位点多次出现的氨基酸，这表明该位点对该类氨基酸改变具有耐受性。空位被视作第三种字符类型，且我们仅考量系统发育层面独立发生的变异空位特征。我们的算法基于遍历系统发育树节点的流程，通过相邻节点间两类氨基酸与空位的概率差异，计算每个分支上的总替换量。我们采用多序列比对掩码策略，以削弱比对质量欠佳区域所产生的误导性伪影。与基于系统发育的工具（如CAPS、CoMap）以及当前主流的基于MSA的方法（DCA、GaussDCA、PSICOV及MIp）相比，以马修斯相关系数（MCC）、受试者工作特征曲线下面积（AUC）和F1分数（F1 score）作为统计衡量指标，本方法在识别共进化位点对时展现出显著更优的准确率。PHACE之所以能够取得优异表现，核心在于其考量了此前常被忽视的系统发育依赖性。

创建时间：

2024-11-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集