Accuracy in near-perfect virus phylogenies
收藏DataCite Commons2026-03-05 更新2026-04-25 收录
下载链接:
https://datadryad.org/dataset/doi:10.6076/D12S3M
下载链接
链接失效反馈官方服务:
资源简介:
Phylogenetic trees from real-world data often include short edges with
very few substitutions per site, which can lead to partially resolved
trees and poor accuracy. Theory indicates that the number of sites needed
to accurately reconstruct a fully resolved tree grows at a rate
proportional to the inverse square of the length of the shortest edge.
However, when inferred trees are partially resolved due to short edges,
"accuracy" should be defined as the rate of discovering
false splits (clades on a rooted tree) relative to the actual number
found. Thus, accuracy can be high even if short edges are common.
Specifically, in a "near-perfect" parameter space in which trees
are large, the tree length ξ (the sum of all edge lengths), is small, and
rate variation is minimal, the expected false positive rate is less than
ξ/3; the exact value depends on tree shape and sequence length. This
expected false positive rate is far below the false negative rate for
small $\xi$ and often well below 5% even when some assumptions are
relaxed. We show this result analytically for maximum parsimony and
explore its extension to maximum likelihood using theory and simulations.
For hypothesis testing, we show that measures of split
"support" that rely on bootstrap resampling
consistently imply weaker support than that implied by the false positive
rates in near-perfect trees. The near-perfect parameter space closely fits
several empirical studies of human virus diversification during outbreaks
and epidemics, including Ebolavirus, Zika virus, and SARS-CoV-2,
reflecting low substitution rates relative to high transmission/sampling
rates in these viruses.
提供机构:
Dryad
创建时间:
2021-08-06



