Data from: Minimizing the average distance to a closest leaf in a phylogenetic tree

DataONE2013-06-27 更新2024-06-27 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

When performing an analysis on a collection of molecular sequences, it can be convenient to reduce the number of sequences under consideration while maintaining some characteristic of a larger collection of sequences. For example, one may wish to select a subset of high-quality sequences that represent the diversity of a larger collection of sequences. One may also wish to specialize a large database of characterized “reference sequences” to a smaller subset that is as close as possible on average to a collection of “query sequences” of interest. Such a representative subset can be useful whenever one wishes to find a set of reference sequences that is appropriate to use for comparative analysis of environmentally-derived sequences, such as for selecting “reference tree” sequences for phylogenetic placement of metagenomic reads. In this paper we formalize these problems in terms of the minimization of the Average Distance to the Closest Leaf (ADCL) and investigate algorithms to perform the relevant minimization. We show that the greedy algorithm is not effective, show that a variant of the Partitioning Among Medoids (PAM) heuristic gets stuck in local minima, and develop an exact dynamic programming approach. Using this exact program we note that the performance of PAM appears to be good for simulated trees, and is faster than the exact algorithm for small trees. On the other hand, the exact program gives solutions for all numbers of leaves less than or equal to the given desired number of leaves, while PAM only gives a solution for the pre-specified number of leaves. Via application to real data, we show that the ADCL criterion chooses chimeric sequences less often than random subsets, while the maximization of phylogenetic diversity chooses them more often than random. These algorithms have been implemented in publicly available software.

在对分子序列集合开展分析工作时，若能在保留大规模序列集合部分关键特征的前提下缩减待考虑序列的数量，往往会带来诸多便捷。例如，研究人员可能希望选取能够代表大规模序列集合多样性的高质量序列子集；也可能需要将带有注释的"参考序列"大型数据库精简为规模更小的子集，使得该子集与目标"查询序列"集合的平均距离尽可能小。此类代表性子集在诸多场景中均可发挥作用：例如当需要寻找适用于环境来源序列比较分析的参考序列集时，比如为宏基因组读段（metagenomic reads）的系统发育放置（phylogenetic placement）选取"参考树"序列时。本文中，我们以"最邻近叶节点平均距离（Average Distance to the Closest Leaf, ADCL）"最小化为目标，对上述问题进行形式化建模，并研究用于实现相关最小化的算法。我们证明了贪心算法在此场景下效果不佳，同时发现围绕中心点划分算法（Partitioning Among Medoids, PAM）的启发式变体容易陷入局部最优，进而提出了一种精确动态规划方法。借助该精确求解程序，我们发现PAM在模拟树场景下表现良好，且对于小规模树而言，其运行速度快于精确算法。而精确程序可生成所有不超过指定目标叶节点数的叶节点数量对应的解，PAM则仅能针对预先指定的叶节点数生成单一解。通过对真实数据的应用测试，我们发现ADCL准则选取嵌合序列的概率低于随机子集，而系统发育多样性最大化准则选取嵌合序列的概率则高于随机子集。上述算法已在公开可用的软件中实现。

创建时间：

2013-06-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集