A branch-heterogeneous model of protein evolution for efficient inference of ancestral sequences

NIAID Data Ecosystem2026-03-07 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.7h66k

下载链接

链接失效反馈

官方服务：

资源简介：

Most models of nucleotide or amino acid substitution used in phylogenetic studies assume that the evolutionary process has been homogeneous across lineages and that composition of nucleotides or amino acids has remained the same throughout the tree. These oversimplified assumptions are refuted by the observation that compositional variability characterizes extant biological sequences. Branch-heterogeneous models of protein evolution that account for compositional variability have been developed, but are not yet in common use because of the large number of parameters required, leading to high computational costs and potential overparameterization. Here, we present a new branch-nonhomogeneous and nonstationary model of protein evolution that captures more accurately the high complexity of sequence evolution. This model, henceforth called Correspondence and likelihood analysis (COaLA), makes use of a correspondence analysis to reduce the number of parameters to be optimized through maximum likelihood, focusing on most of the compositional variation observed in the data. The model was thoroughly tested on both simulated and biological data sets to show its high performance in terms of data fitting and CPU time. COaLA efficiently estimates ancestral amino acid frequencies and sequences, making it relevant for studies aiming at reconstructing and resurrecting ancestral amino acid sequences. Finally, we applied COaLA on a concatenate of universal amino acid sequences to confirm previous results obtained with a nonhomogeneous Bayesian model regarding the early pattern of adaptation to optimal growth temperature, supporting the mesophilic nature of the Last Universal Common Ancestor.

系统发育研究中所采用的多数核苷酸或氨基酸替换模型，均假设进化过程在各谱系间保持均一性，且整个系统发育树内核苷酸与氨基酸的组成始终维持恒定。但现有生物序列普遍表现出组成异质性的特征，这一观测结果驳斥了上述过于简化的假设。针对组成异质性问题，学界已开发出支系异质性蛋白质进化模型，但由于该类模型所需参数数量庞大，会带来高昂的计算成本与潜在的过参数化问题，因此尚未得到广泛应用。本研究提出一种全新的支系非均一、组成非平稳的蛋白质进化模型，可更精准地捕捉序列进化的高度复杂性。该模型被命名为对应分析与似然分析模型（Correspondence and Likelihood Analysis, COaLA），其通过对应分析来缩减最大似然法需优化的参数数量，重点捕获数据中绝大多数的组成变异信息。本研究通过模拟数据集与真实生物数据集对该模型开展了全面测试，结果显示其在数据拟合度与计算耗时两方面均表现优异。COaLA可高效估算祖先氨基酸频率与祖先序列，因此适用于旨在重建、复活祖先氨基酸序列的相关研究。最后，本研究将COaLA应用于一套通用氨基酸拼接序列数据集，验证了此前基于非均一贝叶斯模型得到的、关于适应最优生长温度的早期演化模式的相关结论，支持最后通用共同祖先（Last Universal Common Ancestor）为中温生物的观点。

创建时间：

2013-03-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集