Recoding amino acids to a reduced alphabet may increase or decrease phylogenetic accuracy
收藏DataCite Commons2025-06-01 更新2025-06-15 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.6djh9w11s
下载链接
链接失效反馈官方服务:
资源简介:
Common molecular phylogenetic characteristics such as long branches and
compositional heterogeneity can be problematic for phylogenetic
reconstruction when using amino acid data. Recoding alignments to reduced
alphabets before phylogenetic analysis has often been used both to explore
and potentially decrease the effect of such problems. We tested the
effectiveness of this strategy on topological accuracy using simulated
data on four-taxon trees. We simulated alignments in phylogenetically
challenging ways to test the phylogenetic accuracy of analyses using
various recoding strategies together with commonly-used homogeneous
models. We tested three recoding methods based on amino acid
exchangeability, and another recoding method based on lowering the
compositional heterogeneity among alignment sequences as measured by the
Chi-squared statistic. Our simulation results show that on trees with long
branches where sequences approach saturation, accuracy was not greatly
affected by exchangeability-based recoding, but Chi-squared-based recoding
decreased accuracy. We then simulated sequences with different kinds of
compositional heterogeneity over the tree. Recoding often increased
accuracy on such alignments. Exchangeability-based recoding was rarely
worse than not recoding, and often considerably better. Recoding based on
lowering the Chi-squared value improved accuracy in some cases but not in
others, suggesting that low compositional heterogeneity by itself is not
sufficient to increase accuracy in the analysis of these alignments. We
also simulated alignments using site-specific amino acid profiles, making
sequences that had compositional heterogeneity over alignment sites.
Exchangeability-based recoding coupled with site-homogeneous models had
poor accuracy for these datasets but Chi-squared-based recoding on these
alignments increased accuracy. We then simulated datasets that were
compositionally both site- and tree-heterogeneous, like many real
datasets. The effect on accuracy of recoding such doubly problematic
datasets varied widely, depending on the type of compositional
tree-heterogeneity and on the recoding scheme. Interestingly, analysis of
unrecoded compositionally heterogeneous alignments with the NDCH or CAT
models was generally more accurate than homogeneous analysis, whether
recoded or not. Overall, our results suggest that making trees for recoded
amino acid datasets can be useful, but they need to be interpreted
cautiously as part of a more comprehensive analysis. The use of better
fitting models like NDCH and CAT, which directly account for the patterns
in the data, may offer a more promising long-term solution for analysing
empirical data.
提供机构:
Dryad
创建时间:
2022-03-01



