Data from: Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages

DataONE2014-06-11 更新2024-06-27 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

Molecular phylogenetic studies of homologous sequences of nucleotides often assume that the underlying evolutionary process was globally stationary, reversible and homogeneous (SRH), and that a model of evolution with one or more site-specific and time-reversible rate matrices (e.g., the GTR rate matrix) is enough to accurately model the evolution of data over the whole tree. However, an increasing body of data suggests that evolution under these conditions is an exception, rather than the norm. To address this issue, several non-SRH models of molecular evolution have been proposed, but they either ignore heterogeneity in the substitution process across sites (HAS) or assume it can be modelled accurately using the Γ distribution. As an alternative to these models of evolution, we introduce a family of mixture models that approximate HAS without the assumption of an underlying predefined statistical distribution. This family of mixture models is combined with non-SRH models of evolution that account for heterogeneity in the substitution process across lineages (HAL). We also present two algorithms for searching model space and identifying an optimal model of evolution that is less likely to over- or under-parameterize the data. The performance of the two new algorithms was evaluated using alignments of nucleotides with 10,000 sites simulated under complex non-SRH conditions on a 25-tipped tree. The algorithms were found to be very successful, identifying the correct HAL model with a 75% success rate (the average success rate for assigning rate matrices to the tree's 48 edges was 99.25%) and, for the correct HAL model, identifying the correct HAS model with a 98% success rate. Finally, parameter estimates obtained under the correct HAL-HAS model were found to be accurate and precise. The merits of our new algorithms were illustrated with an analysis of 42,337 second codon sites extracted from a concatenation of 106 alignments of orthologous genes encoded by the nuclear genomes of Saccharomyces cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, S. castellii, S. kluyveri, S. bayanus, and Candida albicans. Our results show that second codon sites in the ancestral genome of these species contained 49.1% invariable sites, 39.6% variable sites belonging to one rate category (V1), and 11.3% variable sites belonging to a second rate category (V2). The ancestral nucleotide content was found to differ markedly across these 3 sets of sites, and the evolutionary processes operating at the variable sites were found to be non-SRH and best modelled by a combination of 8 edge-specific rate matrices (4 for V1 and 4 for V2). The number of substitutions per site at the variable sites also differed markedly, with sites belonging to V1 evolving slower than those belonging to V2 along the lineages separating the 7 species of Saccharomyces. Finally, sites belonging to V1 appeared to have ceased evolving along the lineages separating S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus, implying that they might have become so selectively constrained that they could be considered invariable sites in these species.

核苷酸同源序列的分子系统发育研究通常假定，其进化过程为全局平稳、可逆且均一（globally stationary, reversible and homogeneous, SRH），并认为采用包含一个或多个位点特异性时间可逆速率矩阵（例如GTR速率矩阵（GTR rate matrix））的进化模型，即可准确模拟整个系统发育树中数据的进化过程。然而，越来越多的研究数据表明，符合此类条件的进化实为特例而非常态。为解决这一问题，学界已提出多种非SRH分子进化模型，但这些模型要么忽略了位点间替换过程异质性（heterogeneity in the substitution process across sites, HAS），要么假定可通过Γ分布对该异质性进行精准建模。作为此类进化模型的替代方案，我们提出了一类混合模型家族，可在无需预先假定潜在预设统计分布的前提下近似位点间替换过程异质性。该混合模型家族与可刻画谱系间替换过程异质性（heterogeneity in the substitution process across lineages, HAL）的非SRH进化模型相结合。我们还开发了两种用于搜索模型空间、识别最优进化模型的新算法，此类模型可降低对数据过参数化或欠参数化的风险。我们采用在25个末端分支的系统发育树上于复杂非SRH条件下模拟得到的10000个位点的核苷酸序列比对，对这两种新算法的性能进行了评估。结果显示，两种算法表现优异：其正确识别谱系间替换过程异质性模型的成功率达75%（将速率矩阵分配至该树48条分支的平均成功率为99.25%）；而在谱系间替换过程异质性模型正确的前提下，正确识别位点间替换过程异质性模型的成功率达98%。进一步研究发现，在正确的HAL-HAS联合模型下得到的参数估计值准确且精准。我们通过一项实证分析展示了新算法的优势：该分析针对从酿酒酵母（Saccharomyces cerevisiae）、帕氏酵母（S. paradoxus）、米氏酵母（S. mikatae）、库德里亚兹夫酵母（S. kudriavzevii）、卡氏酵母（S. castellii）、克鲁维酵母（S. kluyveri）、巴氏酵母（S. bayanus）以及白色念珠菌（Candida albicans）的核基因组编码的106个直系同源基因比对序列的拼接结果中提取的42337个第二密码子位点展开。我们的研究结果显示，这些物种的祖先基因组中的第二密码子位点包含49.1%的不变位点、39.6%的属于单一速率类别的可变位点（V1）以及11.3%的属于第二类速率类别的可变位点（V2）。研究发现，这三类位点的祖先核苷酸组成存在显著差异，且可变位点的进化过程均不符合SRH假设，其最优进化模型由8个分支特异性速率矩阵构成（V1和V2各对应4个）。可变位点的每位点替换数也存在显著差异：在分隔7个酿酒酵母属物种的谱系中，属于V1类别的位点进化速率慢于V2类别的位点。此外，在分隔酿酒酵母、帕氏酵母、米氏酵母、库德里亚兹夫酵母以及巴氏酵母的谱系中，属于V1类别的位点似乎已停止进化，这意味着此类位点可能受到了极强的选择约束，可被视为这些物种中的不变位点。

创建时间：

2014-06-11