five

Supporting data for "VC@Scale: Scalable and High Performance Variant Calling on Cluster Environments"

收藏
DataCite Commons2025-05-26 更新2025-04-15 收录
下载链接:
http://gigadb.org/dataset/100912
下载链接
链接失效反馈
官方服务:
资源简介:
In the past couple of years many new deep learning based variant calling methods like DeepVariant have emerged as more accurate method as compared to conventional variant calling algorithms like GATK HaplotypeCaller, Sterlka2, Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster scaled variant calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single node pre-processing and variant calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O based storage for storing intermediate applications output does not exploit the full benefit of Apache Spark in-memory processing. In order to achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrows columnar in-memory data transformations. <br>Here we present a scalable, parallel and efficient implementation of next generation sequencing data pre-processing and variant calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates, and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by more than 2x for the pre-processing stages, creating a scalable and high performance solution for DeepVariant for both CPU-only and CPU+GPU clusters. <br>We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant calling analysis on HPC clusters using the standardized Apache Arrow data representations. All codes, scripts and configurations used to run our workflow are open sourced and publicly available at VC@Scale.
提供机构:
GigaScience Database
创建时间:
2021-07-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作