Fragmentary gene sequences negatively impact gene tree and species tree reconstruction

Mendeley Data2024-05-10 更新2024-06-27 收录

下载链接：

https://zenodo.org/records/8014826

下载链接

链接失效反馈

官方服务：

资源简介：

Species tree reconstruction from genome-wide data is increasingly being attempted, in most cases using a two-step approach of first estimating individual gene trees and then summarizing them to obtain a species tree. The accuracy of this approach, which promises to account for gene tree discordance, depends on the quality of the inferred gene trees. At the same time, phylogenomic and phylotranscriptomic analyses typically use involved bioinformatics pipelines for data preparation. Errors and shortcomings resulting from these preprocessing steps may impact the species tree analyses at the other end of the pipeline. In this article, we first show that the presence of fragmentary data for some species in a gene alignment, as often seen on real data, can result in substantial deterioration of gene trees, and as a result, the species tree. We then investigate a simple filtering strategy where individual fragmentary sequences are removed from individual genes but the rest of the gene is retained. Both in simulations and by reanalyzing a large insect phylotranscriptomic data set, we show the effectiveness of this simple filtering strategy.

基于全基因组数据（genome-wide data）的物种树（species tree）重建研究正日益增多，多数情况下采用两步法：首先推断单个基因树（gene tree），再对其进行汇总以得到物种树。该方法旨在解决基因树冲突（gene tree discordance）问题，其准确性取决于所推断基因树的质量。与此同时，系统发育基因组学（phylogenomic）与系统发育转录组学（phylotranscriptomic）分析通常需借助复杂的生物信息学流程（bioinformatics pipelines）完成数据准备。这些预处理步骤所产生的误差与缺陷，可能会对流程末端的物种树分析造成负面影响。本文首先证实：基因序列比对（gene alignment）中部分物种存在片段化数据（fragmentary data）——这在真实数据中颇为常见——会导致基因树乃至最终物种树的质量显著劣化。随后我们探究了一种简易的过滤策略：从单个基因中移除对应物种的片段化序列，但保留该基因的其余部分。通过模拟实验（simulations）与重新分析一套大型昆虫系统发育转录组数据集，我们证实了该简易过滤策略的有效性。

创建时间：

2023-06-28