Completing gene trees without species trees in sub-quadratic time

Mendeley Data2024-04-13 更新2024-06-27 收录

下载链接：

https://datadryad.org/stash/dataset/doi:10.6076/D1N30V

下载链接

链接失效反馈

官方服务：

资源简介：

We test tripVote on published simulated datasets and a real plant dataset by OneKP Initiative. The simulated datasets were both created using Simphy to generate gene and species trees under the multi-species coalescent model and heterogeneous parameters, and Indelible to simulate nucleotide sequences on gene trees according to the GTR + Γ model with varying sequence lengths and sequence evolution parameters. FastTree2 was used to estimate gene trees based on the GTR + Γ model. The 201-taxon dataset by Mirarab and Warnow (2015) includes incomplete lineage sorting (ILS). We use 3 model conditions with 200 taxa where tree length is 2M and speciation rate is 1e-6, and use the first 20 out of the 50 replicates of the original dataset (to save computational time). In each replicate, we use the first 500 (out of 1000) estimated gene trees that are fully resolved. The three model conditions have medium, high, and very high levels of ILS, resulting in 21, 33, and 69% RF distance between true gene trees and the species tree. They also have high levels of gene tree estimation error (15, 22, and 34% RF between estimated and true gene trees). The 31-taxon dataset by Mai et al. (2017) is used to examine the effect of clock deviations. We use the three model conditions with the root-to-crown ratio equal to 1.0, but varying levels of deviation from the clock (low, medium, high). We only use the first 20 out of the 100 replicates of the original dataset because our experiments are computationally intensive. The average coefficient of variation of root-to-tip distances of low, medium, and high deviations are 0.04, 0.30, and 0.69, respectively. These replicates have moderately high level of ILS (with 24% mean RF distance between true gene trees and the species tree). The amount of gene tree estimation error increases with deviations from the clock (RF errors are 30, 41, and 52%). The real OneKP biological dataset of 1178 plants by OneKP Initiative (2019) has 384 gene trees, all of which miss some of the species. The original study provide an ASTRAL species tree inferred from 384 gene trees, inferred using RAxML, each with at least 1178/2=589 species. We compare tripVote with two alternative tree completion algorithms: ASTRAL-completion, the method used in ASTRAL, and OCTAL, which minimizes RF distance of each gene tree to the species tree. ASTRAL-completion is run using the ASTRAL software, and OCTAL is run using the TRACTION-RF software. In addition, to guide visualization and interpretation, we add a lower-bound control by randomly completing the gene trees (repeated 100 times and averaged). In simulated datasets, we randomly remove m leaves from each estimated gene tree to create incomplete gene trees; m∈{0,1,2,20,50,100} for the 201-taxon dataset and m∈{0,1,2,3,8,15} for the 31-taxon dataset. We use tripVote, OCTAL and ASTRAL-completion to complete each set. For the 201-taxon dataset with m = 1, we compare the accuracy of tripVote with and without the sampling. We also test the ability of tripVote to improve species tree estimation. On the 201-taxon dataset, we compare five versions of ASTRAL for inferring species trees from incomplete gene trees. (i) The default ASTRAL uses ASTRAL-completion to construct the search space and original trees to score. (ii) We use tripVote in place of ASTRAL-completion but continue to score trees using incomplete trees. (iii) We use OCTAL in place of ASTRAL-completion. Since running OCTAL needs a species tree, we use the ASTRAL species tree inferred in (i) as input to OCTAL. Thus, in this setting, ASTRAL is run twice. (iv) We use the gene trees completed by ASTRAL-completion as input to ASTRAL, making them used both for search space creation and scoring. (v) Similarly, we use tripVote completed trees as input. We measure the error of these ASTRAL trees by computing their RF distances to the true species tree. We also test tripVote and ASTRAL-completion on their ability to root an unrooted gene tree with respect to other rooted gene trees. On the simulated datasets, we remove the outgroup from a set of n − k gene trees (arbitrarily selected) and use the k remaining trees to infer the outgroup placement. We vary k in {1,10,50,100,250,500}. We compare tripVote to other rooting methods: the outgroup rooting (root at the original placement of the outgroup before removing it), mid-point rooting and MinVar rooting and the random rooting (as a control). For the OneKP dataset, we set up two versions: one where the original gene trees are used directly and one with extra missing data where we prune out an extra p% of the taxa from each gene tree (for p∈{5,10,15,20}⁠).

我们在已发表的模拟数据集以及千种植物转录组计划（OneKP Initiative）提供的真实植物数据集上测试了tripVote。模拟数据集均通过Simphy生成多物种溯祖模型下的基因树与物种树，并采用异质参数；随后使用Indelible依据GTR+Γ模型在基因树上模拟核苷酸序列，同时设置不同的序列长度与序列进化参数。FastTree2被用于基于GTR+Γ模型估算基因树。由Mirarab与Warnow（2015）构建的201分类群数据集包含不完全谱系分选（Incomplete Lineage Sorting，ILS）。我们采用3组模型条件，每组均包含200个分类群，其树长为2M，物种形成速率为1×10⁻⁶；并选取原始数据集50次重复中的前20次重复以节省计算开销。在每次重复中，我们选取1000棵已完全解析的估算基因树中的前500棵。这3组模型条件分别对应中等、高等以及极高水平的不完全谱系分选，真实基因树与物种树间的罗宾逊-福尔茨（Robinson-Foulds，RF）距离分别为21%、33%与69%。同时它们也存在较高水平的基因树估算误差——估算基因树与真实基因树间的RF距离分别为15%、22%与34%。Mai等人（2017）构建的31分类群数据集被用于检验分子钟偏差的影响。我们采用3组模型条件，其根冠比（root-to-crown ratio）为1.0，但分子钟偏差水平不同（低、中、高）。由于本实验计算量较大，我们仅选取原始数据集100次重复中的前20次重复。低、中、高偏差水平下的根到端距离的平均变异系数分别为0.04、0.30与0.69。这些重复样本存在中等偏高程度的不完全谱系分选（真实基因树与物种树间的平均RF距离为24%），且基因树估算误差随分子钟偏差升高而增大（RF误差分别为30%、41%与52%）。千种植物转录组计划（2019）构建的包含1178种植物的真实OneKP生物学数据集，共包含384棵基因树，所有基因树均缺失部分物种。原始研究提供了一棵由384棵基因树推断得到的ASTRAL物种树，该物种树通过RAxML推断得到，每棵基因树至少包含1178/2=589个物种。我们将tripVote与另外两种树补全算法进行对比：一是ASTRAL补全（ASTRAL-completion），即ASTRAL中使用的补全方法；二是OCTAL，该方法可最小化每棵基因树与物种树间的RF距离。ASTRAL-completion通过ASTRAL软件运行，OCTAL则通过TRACTION-RF软件运行。此外，为辅助可视化与结果解释，我们设置了随机补全基因树的下限对照（重复100次后取平均值）。在模拟数据集上，我们从每棵估算基因树中随机移除m个分类群以构建不完整基因树；对于201分类群数据集，m∈{0,1,2,20,50,100}；对于31分类群数据集，m∈{0,1,2,3,8,15}。我们使用tripVote、OCTAL与ASTRAL-completion对每一组不完整基因树进行补全。针对201分类群数据集且m=1的场景，我们对比了tripVote在启用与禁用采样模式下的准确率。我们还测试了tripVote对物种树推断的提升能力。在201分类群数据集上，我们对比了5种基于不完整基因树推断物种树的ASTRAL变体：(i) 默认ASTRAL：使用ASTRAL-completion构建搜索空间，并使用原始不完整基因树进行树的评分。(ii) 变体二：使用tripVote替代ASTRAL-completion，但仍使用不完整基因树进行树的评分。(iii) 变体三：使用OCTAL替代ASTRAL-completion。由于运行OCTAL需要输入物种树，我们将(i)中推断得到的ASTRAL物种树作为OCTAL的输入。因此该设置下需两次运行ASTRAL。(iv) 变体四：将经ASTRAL-completion补全的基因树作为ASTRAL的输入，使其同时用于搜索空间构建与树的评分。(v) 变体五：类似地，将经tripVote补全的基因树作为ASTRAL的输入。我们通过计算各ASTRAL物种树与真实物种树间的RF距离来衡量其推断误差。我们还测试了tripVote与ASTRAL-completion对无根基因树基于其他有根基因树进行根化的能力。在模拟数据集上，我们从n−k棵基因树中随机移除外类群（任意选取），并使用剩余的k棵基因树推断外类群的放置位置。我们将k设置为{1,10,50,100,250,500}。我们将tripVote与其他根化方法进行对比：外类群根化法（在外类群移除前保留其原始放置位置作为根）、中点根化法、MinVar根化法以及随机根化法（作为对照）。针对OneKP数据集，我们设置了两种测试版本：一种直接使用原始基因树，另一种则添加额外的缺失数据——我们从每棵基因树中额外修剪掉p%的分类群（p∈{5,10,15,20}）。

创建时间：

2023-11-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集