five

PittPack: An open-source Poisson’s equation solver for extreme-scale computing with accelerators

收藏
doi.org2025-03-22 收录
下载链接:
http://doi.org/10.17632/59ktwdby4r.1
下载链接
链接失效反馈
官方服务:
资源简介:
We present a parallel implementation of a direct solver for Poisson’s equation on extreme-scale supercomputers with accelerators. We introduce a chunked-pencil decomposition as the domain-decomposition strategy to distribute work among processing elements to achieve improved scalability at high counts of accelerators. Chunked-pencil decomposition enables overlapping MPI communication and data transfer between the central processing units (CPUs) and the graphics processing units (GPUs). It enables contiguous message transfer among the nodes and improves data locality by keeping neighboring elements in adjacent memory locations while permitting the use of shared memory for certain segments of the algorithm when possible. We study two different communication patterns within the chunked-pencil decomposition. The first pattern fully overlaps the communication with data transfer and aims to speedup the overall turnaround time. The second pattern concentrates on low memory usage and is more network friendly than the first pattern for computations at extreme scale. In our parallel implementation, we interleave OpenACC with MPI to support computations on the GPU or the CPU. The numerical solution and its formal second order of accuracy is verified using the method of manufactured solutions for various combinations of boundary conditions. Additionally, we used PittPack within an incompressible flow solver to further validate its accuracy and as well as demonstrate its versatility as a software package. We performed weak scaling analysis with up to 1.1 trillion Cartesian mesh points distributed over 16384 GPUs on a petascale leadership class supercomputer.

本研究提出了一种在极端规模超级计算机上搭载加速器的泊松方程直接求解器的并行实现。我们引入了分块笔分解作为域分解策略,以在处理元素之间分配工作,从而在高加速器数量下实现提升的扩展性。分块笔分解允许在中央处理器(CPU)和图形处理器(GPU)之间重叠MPI通信和数据传输,并实现节点间的连续消息传输,通过将邻近元素保持在相邻内存位置来提高数据局部性,并在可能的情况下允许算法的某些部分使用共享内存。我们研究了分块笔分解中的两种不同的通信模式。第一种模式完全重叠通信与数据传输,旨在加速整体周转时间。第二种模式专注于低内存使用,与第一种模式相比,对于极端规模的计算更为网络友好。在我们的并行实现中,我们交错使用OpenACC与MPI,以支持在GPU或CPU上的计算。使用制造解法对数值解及其二阶精度进行了验证,测试了各种边界条件的组合。此外,我们在不可压缩流求解器中使用了PittPack,以进一步验证其精度,并展示其作为软件包的通用性。我们在一台百亿级超级计算机上,对多达1.1万亿个笛卡尔网格点进行了弱扩展性分析,这些网格点分布在16384个GPU上。
提供机构:
Mendeley Data
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作