PaScaL_TDMA 2.1: A register-resident multi-GPU tridiagonal matrix solver with optimized communication for large-scale CFD simulations

Name: PaScaL_TDMA 2.1: A register-resident multi-GPU tridiagonal matrix solver with optimized communication for large-scale CFD simulations
Creator: Mendeley Data
Published: 2026-03-20 05:31:29
License: 暂无描述

DataCite Commons2026-03-20 更新2026-05-04 收录

下载链接：

https://data.mendeley.com/datasets/49z6fh94z3

下载链接

链接失效反馈

官方服务：

资源简介：

We present PaScaL_TDMA 2.1, a GPU-oriented release of the PaScaL_TDMA library [3] for efficiently solving large batches of distributed tridiagonal systems on modern multi-GPU platforms. Building on the original CPU-based PaScaL_TDMA formulation and the shared-memory buffering strategy introduced in PaScaL_TDMA 2.0 [2], version 2.1 reformulates the core kernels and communication path to better match the GPU execution model. CUDA threads are mapped to contiguous tridiagonal lines to achieve coalesced global-memory access, and the elimination kernels are optimized to a fully register-resident implementation to reduce memory traffic and synchronization. To lower inter-GPU overhead, the reduced-system assembly is performed via a single consolidated MPI_Alltoall exchange, and the kernel interface is restructured to eliminate descriptor transfers at launch. Benchmarks on the NURION system show that PaScaL_TDMA 2.1 reduces wall time from 0.127 s on dual-socket Intel Skylake CPUs to 9.2 ms on an NVIDIA A100 and 6.1 ms on an H100, corresponding to speedups of 14.0 ×  and 20.7 × , respectively. Strong- and weak-scaling studies quantify the performance gains from the optimization stages and demonstrate sustained scalability on multi-GPU systems. Finally, PaScaL_TDMA 2.1 is integrated into an immersed-boundary LES solver and validated through large-scale CFD simulations, including an industrial-scale cleanroom configuration with up to 128 A100 GPUs and O(10^10) degrees of freedom.

提供机构：

Mendeley Data

创建时间：

2020-12-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集