Data from: To include or not to include: the impact of gene filtering on species tree estimation methods
收藏DataCite Commons2025-06-01 更新2025-06-15 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.km24v
下载链接
链接失效反馈官方服务:
资源简介:
With the increasing availability of whole genome data, many species trees
are being constructed from hundreds to thousands of loci. Although
concatenation analysis using maximum likelihood is a standard approach for
estimating species trees, it does not account for gene tree heterogeneity,
which can occur due to many biological processes, such as incomplete
lineage sorting. Coalescent species tree estimation methods, many of which
are statistically consistent in the presence of incomplete lineage
sorting, include Bayesian methods that co-estimate the gene trees and the
species tree, summary methods that compute the species tree by combining
estimated gene trees, and site-based methods that infer the species tree
from site patterns in the alignments of different loci. Due to concerns
that poor quality loci will reduce the accuracy of estimated species
trees, many recent phylogenomic studies have removed or filtered genes on
the basis of phylogenetic signal and/or missing data prior to inferring
species trees; little is known about the performance of species tree
estimation methods when gene filtering is performed. We examine how
incomplete lineage sorting, phylogenetic signal of individual loci, and
missing data affect the absolute and the relative accuracy of species tree
estimation methods and show how these properties affect methods'
responses to gene filtering strategies. In particular, summary methods
(ASTRAL-II, ASTRID, and MP-EST), a site-based coalescent method
(SVDquartets within PAUP), and an unpartitioned concatenation analysis
using maximum likelihood (RAxML) were evaluated on a heterogeneous
collection of simulated multi-locus datasets, and the following trends
were observed. Filtering genes based on gene tree estimation error
improved the accuracy of the summary methods when levels of incomplete
lineage sorting were low to moderate but did not benefit the summary
methods under higher levels of incomplete lineage sorting, unless gene
tree estimation error was also extremely high (a model condition with few
replicates). Neither SVDquartets nor concatenation analysis using RAxML
benefited from filtering genes on the basis of gene tree estimation error.
Finally, filtering genes based on missing data was either neutral (i.e.,
did not impact accuracy) or else reduced the accuracy of all five methods.
By providing insight into the consequences of gene filtering, we offer
recommendations for estimating species tree in the presence of incomplete
lineage sorting and reconcile seemingly conflicting observations made in
prior studies regarding the impact of gene filtering.
提供机构:
Dryad
创建时间:
2017-09-25



