TreeCluster: Clustering biological sequences using phylogenetic trees

NIAID Data Ecosystem2026-03-11 收录

下载链接：

https://figshare.com/articles/dataset/TreeCluster_Clustering_biological_sequences_using_phylogenetic_trees/9718997

下载链接

链接失效反馈

官方服务：

资源简介：

Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.

基于序列相似性对同源序列进行聚类，是众多生物信息学应用中普遍存在的问题。序列能够聚类的本质根源，在于其系统发育关系。尽管已有这一观察结论，且基于系统发育树定义聚类的方式符合天然逻辑，但绝大多数序列聚类应用并未采用系统发育树，而是转而基于两两序列距离开展计算。随着大规模系统发育推断技术的进步，我们认为基于树的聚类方法尚未得到充分利用。我们提出了一类优化问题：给定任意一棵系统发育树，求解可满足异质性约束的最少聚类数目。我们研究了三类具体约束条件：限制（1）每个聚类的直径、（2）其分支长度之和，或（3）两两距离链的长度。这三类问题均可在与树规模呈线性关系的时间复杂度内求解，且其中两类准则对应的算法已在理论计算机科学文献中有所记载。我们将这些算法实现为一款名为TreeCluster的工具，并在三类应用场景中对其进行测试：微生物组数据的操作分类单元（Operational Taxonomic Unit, OTU）聚类、HIV传播聚类，以及基于分治策略的多序列比对。实验结果表明，相较于其他方法，基于树距离的TreeCluster能够生成内部一致性更强的聚类结果，并提升下游应用的有效性。TreeCluster的开源代码可通过https://github.com/niemasd/TreeCluster获取。

创建时间：

2019-08-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集