Data from: Modeling site heterogeneity with posterior mean site frequency profiles accelerates accurate phylogenomic estimation

Mendeley Data2024-06-25 更新2024-06-27 收录

下载链接：

https://datadryad.org/stash/dataset/doi:10.5061/dryad.gv1q5

下载链接

链接失效反馈

官方服务：

资源简介：

Proteins have distinct structural and functional constraints at different sites that lead to site-specific preferences for particular amino acid residues as the sequences evolve. Heterogeneity in the amino acid substitution process between sites is not modeled by commonly used empirical amino acid exchange matrices. Such model misspecification can lead to artefacts in phylogenetic estimation such as long-branch attraction. Although sophisticated site-heterogeneous mixture models have been developed to address this problem in both Bayesian and maximum likelihood (ML) frameworks, their formidable computational time and memory usage severely limits their use in large phylogenomic analyses. Here we propose a posterior mean site frequency (PMSF) method as a rapid and efficient approximation to full empirical profile mixture models for ML analysis. The PMSF approach assigns a conditional mean amino acid frequency profile to each site calculated based on a mixture model fitted to the data using a preliminary guide tree. These PMSF profiles can then be used for in-depth tree-searching in place of the full mixture model. Compared with widely used empirical mixture models with k classes, our implementation of PMSF in IQ-TREE (http://www.iqtree.org) speeds up the computation by approximately k /1.5-fold and requires a small fraction of the RAM. Furthermore, this speedup allows, for the first time, full nonparametric bootstrap analyses to be conducted under complex site-heterogeneous models on large concatenated data matrices. Our simulations and empirical data analyses demonstrate that PMSF can effectively ameliorate long-branch attraction artefacts. In some empirical and simulation settings PMSF provided more accurate estimates of phylogenies than the mixture models from which they derive.

蛋白质在不同位点存在独特的结构与功能约束，这使得序列演化过程中，不同位点对特定氨基酸残基存在位点特异性偏好。当前常用的经验氨基酸替换矩阵并未建模位点间氨基酸替换过程的异质性。这类模型设定偏误可能会在系统发育（phylogenetic）推断中产生人为假象，例如长枝吸引（long-branch attraction）。尽管针对贝叶斯框架与最大似然（maximum likelihood, ML）框架，学界已开发出精密的位点异质性混合模型以解决该问题，但这类模型极高的计算耗时与内存占用严重限制了其在大规模系统基因组学分析中的应用。本文提出一种后验均值位点频率（posterior mean site frequency, PMSF）方法，可快速高效地近似用于最大似然分析的全经验谱混合模型。PMSF方法会基于通过初步向导树拟合至数据的混合模型，为每个位点计算并赋予条件均值氨基酸频率谱。随后，这些PMSF谱可替代全混合模型用于深度树搜索。与广泛使用的含k个类别的经验混合模型相比，我们在IQ-TREE（http://www.iqtree.org）中实现的PMSF方法可将计算速度提升约k/1.5倍，且仅需极小部分内存资源。此外，该提速首次使得在复杂位点异质性模型下，针对大型串联数据矩阵开展完整的非参数自举（bootstrap）分析成为可能。我们的模拟实验与实证数据分析结果表明，PMSF方法可有效缓解长枝吸引假象。在部分实证与模拟场景中，PMSF方法相较于其衍生的混合模型，能够提供更为准确的系统发育树估计结果。

创建时间：

2023-06-28