Reference genome choice and filtering thresholds jointly influence phylogenomic analyses
收藏DataCite Commons2025-06-01 更新2025-04-10 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.djh9w0w2g
下载链接
链接失效反馈官方服务:
资源简介:
Molecular phylogenies are a cornerstone of modern comparative biology and
are commonly employed to investigate a range of biological phenomena, such
as diversification rates, patterns in trait evolution, biogeography, and
community assembly. Recent work has demonstrated that significant biases
may be introduced into downstream phylogenetic analyses from processing
genomic data; however, it remains unclear whether there are interactions
among bioinformatic parameters or biases introduced through the choice of
reference genome for sequence alignment and variant-calling. We address
these knowledge gaps by employing a combination of simulated and empirical
data sets to investigate to what extent the choice of reference genome in
upstream bioinformatic processing of genomic data influences phylogenetic
inference, as well as the way that reference genome choice interacts with
bioinformatic filtering choices and phylogenetic inference method. We
demonstrate that more stringent minor allele filters bias inferred trees
away from the true species tree topology, and that these biased trees tend
to be more imbalanced and have a higher center of gravity than the true
trees. We find the greatest topological accuracy when filtering sites for
minor allele count > 3–4 in our 51-taxa data sets, while tree
center of gravity was closest to the true value when filtering for sites
with minor allele count > 1-2. In contrast, filtering for missing
data increased accuracy in the inferred topologies; however, this effect
was small in comparison to the effect of minor allele filters and may be
undesirable due to a subsequent mutation spectrum distortion. The bias
introduced by these filters differs based on the reference genome used in
short read alignment, providing further support that choosing a reference
genome for alignment is an important bioinformatic decision with
implications for downstream analyses. These results demonstrate that
attributes of the study system and dataset (and their interaction) add
important nuance for how best to assemble and filter short read genomic
data for phylogenetic inference.
提供机构:
Dryad
创建时间:
2023-11-08



