Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus

NIAID Data Ecosystem2026-03-10 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.nc220

下载链接

链接失效反馈

官方服务：

资源简介：

Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full genomes and genomic fragments), which can compound the computational issues related to MSA. Traditionally, alignments are produced with automated algorithms and then checked and/or corrected “by eye” prior to phylogenetic inference. However, this manual curation is inefficient at the data scales required of modern phylogenetics and results in alignments that are not reproducible. Recently, methods have been developed for fully automating alignments of large data sets, but it is unclear if these methods produce alignments that result in compatible phylogenies when compared to more traditional alignment approaches that combined automated and manual methods. Here we use approximately 33,000 publicly available sequences from the hepatitis B virus (HBV), a globally distributed and rapidly evolving virus, to compare different alignment approaches. Using one data set comprised exclusively of whole genomes and a second that also included sequence fragments, we compared three MSA methods: (1) a purely automated approach using traditional software, (2) an automated approach including by eye manual editing, and (3) more recent fully automated approaches. To understand how these methods affect phylogenetic results, we compared resulting tree topologies based on these different alignment methods using multiple metrics. We further determined if the monophyly of existing HBV genotypes was supported in phylogenies estimated from each alignment type and under different statistical support thresholds. Traditional and fully automated alignments produced similar HBV phylogenies. Although there was variability between branch support thresholds, allowing lower support thresholds tended to result in more differences among trees. Therefore, differences between the trees could be best explained by phylogenetic uncertainty unrelated to the MSA method used. Nevertheless, automated alignment approaches did not require human intervention and were therefore considerably less time-intensive than traditional approaches. Because of this, we conclude that fully automated algorithms for MSA are fully compatible with older methods even in extremely difficult to align data sets. Additionally, we found that most HBV diagnostic genotypes did not correspond to evolutionarily-sound groups, regardless of alignment type and support threshold. This suggests there may be errors in genotype classification in the database or that HBV genotypes may need a revision.

用于系统发育分析的序列比对（多重序列比对，Multiple Sequence Alignment，MSA）是一项关键工作，但随着近期DNA序列数据的爆发式增长，该步骤的计算成本也日益高企。此类序列数据大多可公开获取，但往往存在大量片段化序列（即完整基因组与基因组片段的混合），这会加剧MSA相关的计算难题。传统上，序列比对通常先通过自动化算法生成，随后在进行系统发育推断前需经人工目视核查与/或修正。然而，面对现代系统发育研究所需的数据规模，这种人工整理效率极低，且生成的比对结果无法复现。近年来，学界已开发出可对大规模数据集实现完全自动化的序列比对方法，但相较于结合自动化与人工修正的传统比对方案，此类方法生成的比对结果能否得到一致的系统发育树，目前尚无定论。本研究利用约33000条公开获取的乙型肝炎病毒（Hepatitis B Virus，HBV）序列——该病毒在全球范围内广泛分布且演化迅速——对不同比对方法展开对比。本研究分别构建仅包含完整基因组的数据集，以及同时纳入序列片段的第二套数据集，对三种MSA方法进行对比：（1）基于传统软件的纯自动化比对方案；（2）结合人工目视编辑的自动化比对方案；（3）近年提出的完全自动化比对方案。为明晰这些方法对系统发育分析结果的影响，本研究采用多项评估指标，对比不同比对方法生成的系统发育树拓扑结构。此外，本研究还针对每种比对类型生成的系统发育树，以及不同的统计支持阈值，检验现有HBV基因型的单系性是否得到支持。传统方法与完全自动化方法生成的HBV系统发育树结果相似。尽管分支支持阈值间存在差异，但采用更低的支持阈值往往会导致树结构间出现更多分歧。因此，树结构间的差异可归因于系统发育不确定性，而非所使用的MSA方法。尽管如此，自动化比对方案无需人工干预，因此相较于传统方法，其耗时大幅降低。基于此，本研究得出结论：即使在极难完成比对的数据集上，MSA的完全自动化算法与传统方法依然具有充分的兼容性。此外，本研究发现，无论采用何种比对类型与支持阈值，多数HBV临床分型并未对应演化上的合理类群。这表明数据库中的基因型分类可能存在错误，或是HBV基因型划分亟需修订。

创建时间：

2019-01-30