Completing gene trees without species trees in sub-quadratic time

Mendeley Data2024-03-27 更新2024-06-29 收录

下载链接：

https://datadryad.org/stash/dataset/doi:10.6076/D1N30V

下载链接

链接失效反馈

官方服务：

资源简介：

We test tripVote on published simulated datasets and a real plant dataset by OneKP Initiative. The simulated datasets were both created using Simphy to generate gene and species trees under the multi-species coalescent model and heterogeneous parameters, and Indelible to simulate nucleotide sequences on gene trees according to the GTR + Γ model with varying sequence lengths and sequence evolution parameters. FastTree2 was used to estimate gene trees based on the GTR + Γ model. The 201-taxon dataset by Mirarab and Warnow (2015) includes incomplete lineage sorting (ILS). We use 3 model conditions with 200 taxa where tree length is 2M and speciation rate is 1e-6, and use the first 20 out of the 50 replicates of the original dataset (to save computational time). In each replicate, we use the first 500 (out of 1000) estimated gene trees that are fully resolved. The three model conditions have medium, high, and very high levels of ILS, resulting in 21, 33, and 69% RF distance between true gene trees and the species tree. They also have high levels of gene tree estimation error (15, 22, and 34% RF between estimated and true gene trees). The 31-taxon dataset by Mai et al. (2017) is used to examine the effect of clock deviations. We use the three model conditions with the root-to-crown ratio equal to 1.0, but varying levels of deviation from the clock (low, medium, high). We only use the first 20 out of the 100 replicates of the original dataset because our experiments are computationally intensive. The average coefficient of variation of root-to-tip distances of low, medium, and high deviations are 0.04, 0.30, and 0.69, respectively. These replicates have moderately high level of ILS (with 24% mean RF distance between true gene trees and the species tree). The amount of gene tree estimation error increases with deviations from the clock (RF errors are 30, 41, and 52%). The real OneKP biological dataset of 1178 plants by OneKP Initiative (2019) has 384 gene trees, all of which miss some of the species. The original study provide an ASTRAL species tree inferred from 384 gene trees, inferred using RAxML, each with at least 1178/2=589 species. We compare tripVote with two alternative tree completion algorithms: ASTRAL-completion, the method used in ASTRAL, and OCTAL, which minimizes RF distance of each gene tree to the species tree. ASTRAL-completion is run using the ASTRAL software, and OCTAL is run using the TRACTION-RF software. In addition, to guide visualization and interpretation, we add a lower-bound control by randomly completing the gene trees (repeated 100 times and averaged). In simulated datasets, we randomly remove m leaves from each estimated gene tree to create incomplete gene trees; m∈{0,1,2,20,50,100} for the 201-taxon dataset and m∈{0,1,2,3,8,15} for the 31-taxon dataset. We use tripVote, OCTAL and ASTRAL-completion to complete each set. For the 201-taxon dataset with m = 1, we compare the accuracy of tripVote with and without the sampling. We also test the ability of tripVote to improve species tree estimation. On the 201-taxon dataset, we compare five versions of ASTRAL for inferring species trees from incomplete gene trees. (i) The default ASTRAL uses ASTRAL-completion to construct the search space and original trees to score. (ii) We use tripVote in place of ASTRAL-completion but continue to score trees using incomplete trees. (iii) We use OCTAL in place of ASTRAL-completion. Since running OCTAL needs a species tree, we use the ASTRAL species tree inferred in (i) as input to OCTAL. Thus, in this setting, ASTRAL is run twice. (iv) We use the gene trees completed by ASTRAL-completion as input to ASTRAL, making them used both for search space creation and scoring. (v) Similarly, we use tripVote completed trees as input. We measure the error of these ASTRAL trees by computing their RF distances to the true species tree. We also test tripVote and ASTRAL-completion on their ability to root an unrooted gene tree with respect to other rooted gene trees. On the simulated datasets, we remove the outgroup from a set of n − k gene trees (arbitrarily selected) and use the k remaining trees to infer the outgroup placement. We vary k in {1,10,50,100,250,500}. We compare tripVote to other rooting methods: the outgroup rooting (root at the original placement of the outgroup before removing it), mid-point rooting and MinVar rooting and the random rooting (as a control). For the OneKP dataset, we set up two versions: one where the original gene trees are used directly and one with extra missing data where we prune out an extra p% of the taxa from each gene tree (for p∈{5,10,15,20}⁠).

创建时间：

2023-06-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集