five

Data from: More on the best evolutionary rate for phylogenetic analysis

收藏
DataONE2017-05-25 更新2024-06-26 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈
官方服务:
资源简介:
The accumulation of genome-scale molecular datasets for non-model taxa brings us ever closer to resolving the tree of life of all living organisms. However, despite the depth of data available, a number of studies that each used thousands of genes have reported conflicting results. The focus of phylogenomic projects must thus shift to more careful experimental design. Even though we still have a limited understanding of what are the best predictors of the phylogenetic informativeness of a gene, there is wide agreement that one key factor is its evolutionary rate; but there is no consensus as to whether the rates derived as optimal in various analytical, empirical, and simulation approaches have any general applicability. We here use simulations to infer optimal rates in a set of realistic phylogenetic scenarios with varying tree sizes, numbers of terminals, and tree shapes. Furthermore, we study the relationship between the optimal rate and rate-variation among sites and among lineages. Finally, we examine how well the predictions made by a range of experimental-design methods correlate with the observed performance in our simulations. We find that the optimal level of divergence is surprisingly robust to differences in taxon sampling and even to among-site and among-lineage rate variation as often encountered in empirical datasets. This finding encourages the use of methods that rely on a single optimal rate to predict a gene’s utility. Focusing on correct recovery either of the most basal node in the phylogeny or of the entire topology, the optimal rate is about 0.45 substitutions from root to tip in average Yule trees and about 0.2 in difficult trees with short basal and long apical branches, but all rates leading to divergence levels between about 0.1 and 0.5 perform reasonably well.Testing the performance of six methods that can be used to predict a gene’s utility against our simulation results, we find that the probability of resolution, signal-noise analysis, and Fisher information are good predictors of phylogenetic informativeness, but they require specification of at least part of a model tree. Likelihood quartet mapping also shows very good performance, but only requires sequence alignments and is thus applicable without making assumptions about the phylogeny. Despite them being the most commonly used methods for experimental design, geometric quartet mapping and the integration of phylogenetic informativeness curves perform rather poorly in our comparison. Instead of derived predictors of phylogenetic informativeness, we suggest that the number of sites in a gene that evolve at near-optimal rates (as inferred here) could be used directly to prioritize genes for phylogenetic inference. In combination with measures of model fit, especially with respect to compositional biases and among-site and among-lineage rate variation, such an approach has the potential to greatly improve marker choice and should be tested on empirical data.

非模式类群 (non-model taxa) 的基因组级分子数据集 (genome-scale molecular datasets) 不断积累,使我们距厘清所有现存生物的生命之树 (tree of life) 又近了一步。然而,尽管现有数据体量庞大,诸多依托数千个基因开展的研究却得出了相互矛盾的结论。因此,系统发育基因组学研究 (phylogenomic projects) 的重心需转向更为严谨的实验设计。尽管我们对决定基因系统发育信息性 (phylogenetic informativeness) 的最佳预测因子仍知之甚少,但学界已广泛达成共识:进化速率 (evolutionary rate) 是其中一项关键因素;不过,针对不同分析、实证及模拟研究中推导得到的最优进化速率是否具备普适适用性 (general applicability),学界尚未形成统一意见。本研究通过模拟实验 (simulations),在一系列包含不同树规模、终端分类单元 (terminal taxa) 数量及树结构的真实系统发育场景中推导最优进化速率。此外,本研究还探讨了最优进化速率与位点间、支系间进化速率异质性 (rate-variation among sites and among lineages) 之间的关联。最后,本研究评估了多种实验设计方法所做出的预测,与本研究模拟实验中观测到的实际表现之间的相关性。本研究发现,最优分化水平 (level of divergence) 对类群采样 (taxon sampling) 差异,乃至实证数据集 (empirical datasets) 常见的位点间、支系间进化速率异质性,均表现出出人意料的鲁棒性 (robustness)。这一发现为依托单一最优进化速率预测基因适用性的方法提供了理论支撑。若以准确重建系统发育的基部节点 (basal node) 或完整拓扑结构 (topology) 为目标,在典型尤勒树 (Yule trees) 中,最优进化速率约为0.45次替换(从根到端平均);在基部短、顶端长的困难演化树中,最优速率约为0.2次替换。不过,当分化水平介于0.1至0.5之间时,所有速率对应的表现均较为可观。本研究以模拟结果为参照,对六种可用于预测基因适用性的方法进行性能测试,结果显示:分辨率概率 (probability of resolution) 法、信号噪声分析 (signal-noise analysis) 法与费舍尔信息 (Fisher information) 法均可较好地预测基因的系统发育信息性,但这些方法需要预先指定至少部分模型树 (model tree) 结构。似然四重奏映射法 (likelihood quartet mapping) 同样表现优异,且仅需输入序列比对 (sequence alignments) 结果,无需对系统发育结构做出假设,因此适用范围更广。尽管几何四重奏映射法 (geometric quartet mapping) 与系统发育信息性曲线 (phylogenetic informativeness curves) 整合法是当前实验设计中最常用的方法,但在本研究的对比测试中二者表现欠佳。相较于通过衍生指标预测系统发育信息性的方法,我们建议可直接统计基因中进化速率接近最优值的位点数(如本研究推导所得),以此作为系统发育推断 (phylogenetic inference) 中基因优先级排序的依据。若结合模型拟合度 (model fit) 指标——尤其是针对组成偏倚 (compositional biases)、位点间及支系间进化速率异质性的拟合指标——该方法有望大幅优化分子标记选择 (marker choice) 流程,且有待在实证数据中进一步验证。
创建时间:
2017-05-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作