Data from: The multispecies coalescent model outperforms concatenation across diverse phylogenomic
收藏DataCite Commons2025-04-01 更新2025-04-09 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.7q6q3s0
下载链接
链接失效反馈官方服务:
资源简介:
A statistical framework of model comparison and model validation is
essential to resolving the debates over concatenation and coalescent
models in phylogenomic data analysis. A set of statistical tests are here
applied and developed to evaluate and compare the adequacy of
substitution, concatenation, and multispecies coalescent (MSC) models
across 47 phylogenomic data sets collected across tree of life. Tests for
substitution models and the concatenation assumption of topologically
concordant gene trees suggest that a poor fit of substitution models (44%
of loci rejecting the substitution model) and concatenation models (38% of
loci rejecting the hypothesis of topologically congruent gene trees) is
widespread. Logistic regression shows that the proportions of GC
content and informative sites are both negatively correlated with the fit
of substitution models across loci. Moreover, a substantial violation of
the concatenation assumption of congruent gene trees is consistently
observed across 6 major groups (birds, mammals, fish, insects, reptiles,
and others, including other invertebrates). In contrast, Bayesian model
validation and comparison analyses conducted even on data sets reduced for
computational efficiency suggest that, among those loci adequately
described by a given substitution model, the proportion of loci rejecting
the MSC model is 11%, significantly lower than those rejecting the
substitution and concatenation models, and Bayesian model comparison
strongly favors the MSC over concatenation across all data sets. Species
tree inference suggests that loci rejecting the MSC have little effect on
species tree estimation. Our analysis reveals the value of model
validation and comparison in phylogenomic data analysis, as well as the
need for further improvements of multilocus models and computational tools
for phylogenetic inference.
模型比较与模型验证的统计框架对于解决系统发育组学数据分析中关于串联模型与溯祖模型的争论至关重要。本研究应用并开发了一系列统计检验方法,针对涵盖整个生命树的47个系统发育组学数据集,评估并比较替换模型、串联模型以及多物种溯祖模型(multispecies coalescent, MSC)的适用性。针对替换模型及串联模型所假设的基因树拓扑一致性的检验结果显示,替换模型拟合度不足(44%的基因座拒绝替换模型)与串联模型拟合度不足(38%的基因座拒绝基因树拓扑一致的假设)现象普遍存在。逻辑回归分析表明,GC含量占比与信息位点占比均与各基因座替换模型的拟合度呈负相关。此外,在6个主要类群(鸟类、哺乳类、鱼类、昆虫类、爬行类及其他类群,包括其他无脊椎动物)中,串联模型所假设的基因树一致性均存在显著违背现象。相比之下,即使对为提升计算效率而简化的数据集进行贝叶斯模型验证与比较分析,结果仍显示:在那些可由特定替换模型充分描述的基因座中,拒绝MSC模型的比例仅为11%,显著低于拒绝替换模型与串联模型的比例;且贝叶斯模型比较在所有数据集上均强烈支持MSC模型而非串联模型。物种树推断结果表明,拒绝MSC模型的基因座对物种树估计的影响极小。本研究揭示了模型验证与比较在系统发育组学数据分析中的价值,同时也指出了对多位点模型及系统发育推断计算工具进行进一步改进的必要性。
提供机构:
Dryad
创建时间:
2020-01-06



