Data from: Who let the CAT out of the bag? accurately dealing with substitutional heterogeneity in phylogenomic analyses
收藏DataCite Commons2025-05-01 更新2025-05-10 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.85b2m
下载链接
链接失效反馈官方服务:
资源简介:
As phylogenetic datasets have increased in size, site-heterogeneous
substitution models such as CAT-F81 and CAT-GTR have been advocated in
favor of other models because they purportedly suppress long-branch
attraction (LBA). These models are two of the most commonly used models in
phylogenomics, and they have been applied to a variety of taxa ranging
from Drosophila to land plants. However, many arguments in favor of CAT
models have been based on tenuous assumptions about the true phylogeny
rather than rigorous testing with known trees via simulation. Moreover,
CAT models have not been compared to other approaches for handling
substitutional heterogeneity such as data partitioning with
site-homogeneous substitution models. We simulated amino acid sequence
datasets with substitutional heterogeneity on a variety of tree shapes
including those susceptible to LBA. Data were analyzed with both CAT
models and partitioning to explore model performance; in total over
670,000 CPU hours were used, of which over 97% was spent running analyses
with CAT models. In many cases, all models recovered branching patterns
that were identical to the known tree. However, CAT-F81 consistently
performed worse than other models in inferring the correct branching
patterns, and both CAT models often overestimated substitutional
heterogeneity. Additionally, reanalysis of two empirical metazoan datasets
supports the notion that CAT-F81 tends to recover less accurate trees than
data partitioning and CAT-GTR. Given these results, we conclude that
partitioning and CAT-GTR perform similarly in recovering accurate
branching patterns. However, computation time can be orders of magnitude
less for data partitioning, with commonly used implementations of CAT-GTR
often failing to reach completion in a reasonable time frame (i.e., for
Bayesian analyses to converge). Practices such as removing constant sites
and parsimony uninformative characters, or using CAT-F81 when CAT-GTR is
deemed too computationally expensive, cannot be logically justified. Given
clear problems with CAT-F81, phylogenies previously inferred with this
model should be reassessed.
提供机构:
Dryad
创建时间:
2016-09-07



