Data from: Mitochondrial phylogenomics of early land plants: mitigating the effects of saturation, compositional heterogeneity, and codon-usage bias

DataONE2014-07-29 更新2024-06-27 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

Phylogenetic analyses using concatenation of genomic-scale data have been seen as the panacea to resolving the incongruences among inferences from few or single genes. However, phylogenomics may also suffer from systematic errors, due to the, perhaps cumulative, effects of saturation, among-taxa compositional (GC content) heterogeneity, or codon-usage bias plaguing the individual nucleotide loci that are concatenated. Here we provide an example of how these factors affect the inferences of the phylogeny of early land plants based on mitochondrial genomic data. Mitochondrial sequences evolve slowly in plants and hence are thought to be suitable for resolving deep relationships. We newly assembled mitochondrial genomes from 20 bryophytes, complemented these with 40 other streptophytes (land plants plus algal outgroups), compiling a data matrix of 60 taxa and 41 mitochondrial genes. Homogeneous analyses of the concatenated nucleotide data resolve mosses as sister-group to the remaining land plants. However, the corresponding translated amino acid data support the liverwort lineage in this position. Both results receive weak to moderate support in maximum likelihood analyses, but strong support in Bayesian inferences. Tests of alternative hypotheses using either nucleotide or amino-acid data provide implicit support for the respective optimal topologies. By analyzing the nucleotide data, we found that the 3rd codon positions are more saturated than the 1st and 2nd codon positions, and excluding these from the analyses leads to a topology congruent with that obtained using amino-acid data. Further, we determined that land plant lineages differ in their nucleotide composition, and in their usage of synonymous codon variants. Composition heterogeneous Bayesian analyses employing a non-stationary model that accounts for variation in among-lineage composition, and inferences from degenerated nucleotide data that avoids the effects of synonymous mutations that underlie codon-usage bias, again recovered liverworts being sister to the remaining land plants. These analyses indicate that the discrepancy between the nucleotide-based and the amino acid-based trees is caused by the lineage specific, parallel compositional bias, or synonymous mutations driving codon-usage bias, as well as saturation in the 3rd codon positions. While genomic data may generate highly supported phylogenetic trees, these inferences may be artifacts. We suggest that phylogenomic analyses should assess the possible impact of potential biases through comparisons of protein coding gene data and their amino-acids translations, by analyzing data modeling compositional bias, and by excluding nucleotide noisy signals due to saturation or codon-usage bias. We caution against relying on any one presentation of the data (nucleotide or amino acid) or any one type of analysis even when analyzing large-scale data sets, no matter how well-supported, without fully exploring the effects of substitution models.

基于基因组规模数据联合分析的系统发育研究曾被视为解决少量或单个基因推断结果间冲突的万能灵药。然而，系统发育基因组学（phylogenomics）也可能受到系统误差的困扰，这些误差源自序列饱和（saturation）、类群间组成（GC含量）异质性，或是困扰联合分析中单个核苷酸位点的密码子使用偏好（codon-usage bias），且这些因素的影响可能具有累积性。在此，我们以基于线粒体基因组数据的早期陆生植物系统发育推断为例，展示了这些因素如何影响分析结果。植物线粒体序列演化速率缓慢，因此被认为适用于解析深度演化关系。我们新组装了20种苔藓植物（bryophytes）的线粒体基因组，并辅以40种其他链型植物（streptophytes，包含陆生植物及藻类外类群）的数据，最终构建了包含60个类群、41个线粒体基因的数据集矩阵。对联合核苷酸数据进行均一模型分析后，结果显示苔藓植物为其余陆生植物的姊妹群（sister-group）。但对应的翻译氨基酸数据则支持地钱类群处于这一系统位置。两种结果在最大似然分析（maximum likelihood analyses）中仅获得弱至中等支持度，但在贝叶斯推断（Bayesian inferences）中得到了强烈支持。利用核苷酸或氨基酸数据对备选假设（alternative hypotheses）进行检验时，分别为各自的最优拓扑结构提供了隐含支持。通过分析核苷酸数据，我们发现密码子第三位的序列饱和程度高于第一、二位密码子；去除这些位点后，得到的拓扑结构与氨基酸数据获得的结果一致。进一步分析表明，陆生植物类群在核苷酸组成及同义密码子变异（synonymous codon variants）的使用上存在差异。采用考虑类群间组成变异的非平稳模型（non-stationary model）进行组成异质性贝叶斯分析，以及使用规避同义突变（synonymous mutations，密码子使用偏好的核心诱因）影响的简并核苷酸数据（degenerated nucleotide data）进行推断，均再次得到地钱类群为其余陆生植物姊妹群的结果。这些分析表明，基于核苷酸与氨基酸数据得到的系统发育树之间的差异，是由类群特异性的平行组成偏倚、驱动密码子使用偏好的同义突变，以及密码子第三位的序列饱和共同导致的。尽管基因组数据可构建得到支持度极高的系统发育树，但这类推断结果可能存在人为假象（artifacts）。我们建议，系统发育基因组学分析应通过比较蛋白质编码基因数据及其氨基酸翻译产物、采用模拟组成偏倚的数据模型，以及去除因序列饱和或密码子使用偏好带来的核苷酸噪声信号，来评估潜在偏倚可能造成的影响。我们提醒研究者，即便分析大规模数据集且结果支持度极高，在未充分探究替换模型（substitution models）的影响前，不应仅依赖单一的数据呈现形式（核苷酸或氨基酸数据）或单一类型的分析方法。

创建时间：

2014-07-29