Data from: Who let the CAT out of the bag? accurately dealing with substitutional heterogeneity in phylogenomic analyses

DataONE2016-09-07 更新2024-06-26 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

As phylogenetic datasets have increased in size, site-heterogeneous substitution models such as CAT-F81 and CAT-GTR have been advocated in favor of other models because they purportedly suppress long-branch attraction (LBA). These models are two of the most commonly used models in phylogenomics, and they have been applied to a variety of taxa ranging from Drosophila to land plants. However, many arguments in favor of CAT models have been based on tenuous assumptions about the true phylogeny rather than rigorous testing with known trees via simulation. Moreover, CAT models have not been compared to other approaches for handling substitutional heterogeneity such as data partitioning with site-homogeneous substitution models. We simulated amino acid sequence datasets with substitutional heterogeneity on a variety of tree shapes including those susceptible to LBA. Data were analyzed with both CAT models and partitioning to explore model performance; in total over 670,000 CPU hours were used, of which over 97% was spent running analyses with CAT models. In many cases, all models recovered branching patterns that were identical to the known tree. However, CAT-F81 consistently performed worse than other models in inferring the correct branching patterns, and both CAT models often overestimated substitutional heterogeneity. Additionally, reanalysis of two empirical metazoan datasets supports the notion that CAT-F81 tends to recover less accurate trees than data partitioning and CAT-GTR. Given these results, we conclude that partitioning and CAT-GTR perform similarly in recovering accurate branching patterns. However, computation time can be orders of magnitude less for data partitioning, with commonly used implementations of CAT-GTR often failing to reach completion in a reasonable time frame (i.e., for Bayesian analyses to converge). Practices such as removing constant sites and parsimony uninformative characters, or using CAT-F81 when CAT-GTR is deemed too computationally expensive, cannot be logically justified. Given clear problems with CAT-F81, phylogenies previously inferred with this model should be reassessed.

随着系统发育数据集的规模持续扩张，诸如CAT-F81与CAT-GTR这类位点异质性替换模型，因据称可抑制长枝吸引（long-branch attraction, LBA）效应，被学界推崇为优于其他模型的选择。此类模型是系统发育基因组学（phylogenomics）中最常用的两类替换模型，已被应用于从果蝇（Drosophila）到陆生植物的各类分类群研究中。然而，诸多支持CAT模型的论证，均基于关于真实系统发育关系的薄弱假设，而非通过模拟已知树结构开展的严谨检验。此外，CAT模型尚未与其他处理替换异质性的方法（如采用位点均质性替换模型的数据分区策略）进行过直接比较。本研究针对多种树结构（包括易受LBA影响的树结构）模拟了带有替换异质性的氨基酸序列数据集。本研究同时采用CAT模型与数据分区策略对数据集进行分析，以探究不同模型的性能表现；本次研究总计消耗超过67万CPU小时，其中超97%的算力用于CAT模型的分析运行。在多数场景下，所有模型均可恢复出与已知树结构完全一致的分支模式。但CAT-F81在推断正确分支模式的任务中始终表现逊于其他模型，且两类CAT模型均常常高估替换异质性水平。此外，对两组实测后生动物（metazoan）数据集的重新分析结果也支持这一结论：相较于数据分区策略与CAT-GTR模型，CAT-F81往往只能恢复出准确度更低的系统发育树。基于上述结果，本研究得出结论：数据分区与CAT-GTR模型在恢复准确分支模式的表现上不相上下。但数据分区策略的计算耗时可低至数个数量级，而当前常用的CAT-GTR实现版本往往无法在合理时限内完成收敛（即贝叶斯分析无法达成收敛）。诸如移除恒定位点与简约法（parsimony）无信息特征，或是在CAT-GTR计算成本过高时改用CAT-F81的操作，均无法从逻辑上得到证实。鉴于CAT-F81存在明确的缺陷，过往基于该模型推断得到的系统发育关系均需重新评估。

创建时间：

2016-09-07