five

PaScaL_TDMA 2.1: A register-resident multi-GPU tridiagonal matrix solver with optimized communication for large-scale CFD simulations

收藏
DataCite Commons2026-03-20 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/49z6fh94z3
下载链接
链接失效反馈
官方服务:
资源简介:
We present PaScaL_TDMA 2.1, a GPU-oriented release of the PaScaL_TDMA library [3] for efficiently solving large batches of distributed tridiagonal systems on modern multi-GPU platforms. Building on the original CPU-based PaScaL_TDMA formulation and the shared-memory buffering strategy introduced in PaScaL_TDMA 2.0 [2], version 2.1 reformulates the core kernels and communication path to better match the GPU execution model. CUDA threads are mapped to contiguous tridiagonal lines to achieve coalesced global-memory access, and the elimination kernels are optimized to a fully register-resident implementation to reduce memory traffic and synchronization. To lower inter-GPU overhead, the reduced-system assembly is performed via a single consolidated MPI_Alltoall exchange, and the kernel interface is restructured to eliminate descriptor transfers at launch. Benchmarks on the NURION system show that PaScaL_TDMA 2.1 reduces wall time from 0.127 s on dual-socket Intel Skylake CPUs to 9.2 ms on an NVIDIA A100 and 6.1 ms on an H100, corresponding to speedups of 14.0 ×  and 20.7 × , respectively. Strong- and weak-scaling studies quantify the performance gains from the optimization stages and demonstrate sustained scalability on multi-GPU systems. Finally, PaScaL_TDMA 2.1 is integrated into an immersed-boundary LES solver and validated through large-scale CFD simulations, including an industrial-scale cleanroom configuration with up to 128 A100 GPUs and O(10^10) degrees of freedom.
提供机构:
Mendeley Data
创建时间:
2020-12-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作