Implementing a neural network interatomic model with performance portability for emerging exascale architectures
收藏doi.org2025-03-27 收录
下载链接:
http://doi.org/10.17632/x948kyy7jh.1
下载链接
链接失效反馈官方服务:
资源简介:
The two main thrusts of computational science are increasingly accurate predictions and faster calculations; to this end, the zeitgeist in molecular dynamics (MD) simulations is pursuing machine learned and data driven interatomic models, e.g. neural network potentials, and novel hardware architectures, e.g. GPUs. Current implementations of neural network potentials are orders of magnitude slower than traditional interatomic models and while looming exascale computing offers the ability to run large, accurate simulations with these models, achieving portable performance for MD with new and varied exascale hardware requires rethinking traditional algorithms, using novel data structures, and library solutions. We re-implement a neural network interatomic model in CabanaMD, an MD proxy application, built on libraries developed for performance portability. Our implementation shows significantly improved thread scaling in this complex kernel as compared to a current LAMMPS implementation, across both strong and weak scaling. Our single-source solution enables simulations up to 20 million atoms on a single CPU node and 4 million atoms with improved performance on a single GPU. We also explore parallelism and data layout choices (using flexible data structures called AoSoAs) and their effect on performance, seeing up to ∼50% and ∼5% improvements in performance on a GPU by choosing the right level of parallelism and data layout respectively.
计算科学的两大主要发展方向为预测的日益精确与计算的显著加速;为此,分子动力学(MD)模拟领域正致力于研究基于机器学习和数据驱动的原子间模型,例如神经网络势能,以及新型硬件架构,例如GPU。当前神经网络势能的实现速度比传统的原子间模型慢数个数量级,尽管即将到来的百亿亿次级计算能力为使用这些模型运行大型、精确的模拟提供了可能,但要实现与新型多样的百亿亿次级硬件相匹配的可移植性能,则需要重新思考传统的算法,采用新颖的数据结构以及库解决方案。我们在基于为性能可移植性开发的库构建的MD代理应用CabanaMD中重新实现了神经网络原子间模型。与当前的LAMMPS实现相比,我们的实现显著提升了该复杂内核的线程扩展性,无论是在强扩展性还是弱扩展性方面。我们的单源解决方案使得在单个CPU节点上能够进行至多2000万个原子的模拟,并在单个GPU上实现了对400万个原子的性能提升。此外,我们还探讨了并行化和数据布局选择(使用称为AoSoA的灵活数据结构)及其对性能的影响,通过选择合适的并行化级别和数据布局,在GPU上分别实现了高达约50%和约5%的性能提升。
提供机构:
Mendeley Data



