Data from: EPA-ng: massively parallel evolutionary placement of genetic sequences
收藏DataONE2018-08-22 更新2024-06-08 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈官方服务:
资源简介:
Next Generation Sequencing (NGS) technologies have led to a ubiquity of molecular sequence data. This data avalanche is particularly challenging in metagenetics, which focuses on taxonomic identification of sequences obtained from diverse microbial environments. Phylogenetic placement methods determine how these sequences fit into anevolutionary context. Previous implementations of phylogenetic placement algorithms, such as the Evolutionary Placement Algorithm (EPA) included in RAxML, or pplacer, are being increasingly used for this purpose. However, due to the steady progress in NGS technologies, the current implementations face substantial scalability limitations. Here we present EPA-ng, a complete reimplementation of the EPA that is substantially faster, offers a distributed memory parallelization, and integrates concepts from both, RAxML-EPA and pplacer. EPA-ng can be executed on standard shared memory, as well as on distributed memory systems (e.g., computing clusters). To demonstrate the scalability of EPA-ng we placed 1 billion metagenetic reads from the Tara Oceans Project onto a reference tree with 3,748 taxa in just under 7 hours, using 2,048 cores. Our performance assessment shows that EPA-ng outperforms RAxML-EPA and pplacer by up to a factor of 30 in sequential execution mode, while attaining comparable parallel efficiency on shared memory systems. We further show that the distributed memory parallelization of EPA-ng scales well up to 2,048 cores. EPA-ng is available under the AGPLv3 license: https://github.com/Pbdas/epa-ng
下一代测序(Next Generation Sequencing, NGS)技术推动了分子序列数据的广泛普及。这一数据洪流在宏遗传学(metagenetics)领域尤其具有挑战性,该领域聚焦于对从多样微生物环境中获取的序列开展分类学鉴定。系统发育放置(Phylogenetic placement)方法用于确定这些序列如何融入进化框架。此前的系统发育放置算法实现,例如RAxML中集成的进化放置算法(Evolutionary Placement Algorithm, EPA)以及pplacer,正日益被应用于此类任务。然而,随着NGS技术的稳步发展,现有算法实现面临着显著的可扩展性限制。本研究推出EPA-ng,这是EPA的全新重实现版本,其运行速度大幅提升,支持分布式内存并行化,并融合了RAxML-EPA与pplacer的设计理念。EPA-ng既可在标准共享内存系统上运行,也可部署于分布式内存系统(例如计算集群)。为验证EPA-ng的可扩展性,研究团队使用2048个核心,在不到7小时的时间内,将来自塔拉海洋(Tara Oceans)项目的10亿条宏遗传学序列读段放置到包含3748个分类单元的参考系统发育树中。性能评估结果显示,在单线程执行模式下,EPA-ng的运行效率较RAxML-EPA与pplacer最高提升30倍;而在共享内存系统上,其并行效率与二者相当。研究进一步证实,EPA-ng的分布式内存并行化方案在最多2048个核心的规模下仍可实现良好的扩展性。EPA-ng采用AGPLv3开源许可协议发布,源代码托管地址为:https://github.com/Pbdas/epa-ng
创建时间:
2018-08-22



