Metascalable hybrid message-passing and multithreading algorithms for n-tuple computation
收藏Mendeley Data2024-01-31 更新2024-06-29 收录
下载链接:
https://digitallibrary.usc.edu/asset-management/2A3BF163PT56
下载链接
链接失效反馈官方服务:
资源简介:
The emergence of the multicore era has granted unprecedented computing capabilities. Extensively available multicore clusters have influenced hybrid message-passing and multithreading parallel algorithms to become a standard parallelization for modern clusters. However, hybrid parallel applications of portable scalability on emerging high-end multicore clusters consisting of multimillion cores are yet to be accomplished. Achieving scalability on emerging multicore platforms is an enormous challenge, since we do not even know the architecture of future platforms, with new hardware features such as hardware transactional memory (HTM) constantly being deployed. Scalable implementation of molecular dynamics (MD) simulations on massively parallel computers has been one of the major driving forces of supercomputing technologies. Especially, recent advancements in reactive MD simulations based on many-body interatomic potentials have necessitated efficient dynamic n-tuple computation. Hence, it is of great significance now to develop scalable hybrid n-tuple computation algorithms to provide a viable foundation for high-performance parallel-computing software on forthcoming architectures. ❧ This dissertation research develops a scalable hybrid message-passing and multithreading algorithm for n-tuple MD simulation, which will continue to scale on future architectures (i.e. achieving metascalability). The two major goals of this dissertation research are: (1) design a scalable hybrid message-passing and multithreading parallel algorithmic framework on multicore architectures and evaluate it on most advanced parallel architectures; and (2) develop a computation-pattern algebraic framework to design scalable algorithms for general n-tuple computation and prove its optimality in a systematic and mathematically rigorous manner. ❧ To achieve the first goal, we have developed and thoroughly analyzed algorithms for hybrid message passing interface (MPI) + open multiprocessing (OpenMP) parallelization of n-tuple MD simulation, which are scalable on large multicore clusters. Two data-privatization thread scheduling algorithms via nucleation-growth allocation have been designed: (1) compact-volume allocation scheduling (CVAS); and (2) breadth-first allocation scheduling (BFAS). These two algorithms combine fine-grain dynamic load balancing and minimal memory-footprint threading. Theoretical study has revealed decent asymptotic memory efficiency for both algorithms, thereby reducing 75% memory consumption compared to a naïve-threading algorithm. Furthermore, performance benchmarks have confirmed higher performance of the hybrid MD algorithm over a traditional algorithm on large multicore clusters, where 2.58-fold speedup of the hybrid algorithm over the traditional algorithm was observed on 32,768 nodes of IBM BlueGene/P. ❧ We have also investigated the performance characteristics of HTM on the IBM BlueGene/Q computer in comparison with conventional concurrency control mechanisms, using an MD application as an example. Benchmark tests, along with overhead-cost and scalability analysis, have quantified relative performance advantages of HTM over other mechanisms. We found that the bookkeeping cost of HTM is high but that the rollback cost is low. We have proposed transaction fusion and spatially compact scheduling techniques to reduce the overhead of HTM with minimal programming. A strong scalability benchmark has shown that the fused HTM has the shortest runtime among various concurrency control mechanisms without extra memory. Based on the performance characterization, we have derived a decision tree in the concurrency-control design space for general multithreading applications. ❧ To achieve the second goal, we have developed a computation-pattern algebraic framework to mathematically formulate general n-tuple computation. Based on translation/reflection-invariant properties of computation patterns within this framework, we have designed a shift-collapse (SC) algorithm for cell-based parallel MD. Theoretical analysis has quantified the compact n-tuple search space and small communication cost of SC-MD for arbitrary n, which are reduced to those in best pair-computation approaches (e.g. eighth-shell method) for n = 2. Benchmark tests have shown that SC-MD outperforms our production MD code at the finest grain, with 9.7- and 5.1-fold speedups on Intel-Xeon and BlueGene/Q clusters. SC-MD has also exhibited excellent strong scalability. ❧ In addition, we have analyzed the computational and data-access patterns of MD, which led to the development of a performance prediction model for short-range pair-wise force computations in MD simulations. The analysis and performance model provide fundamental understanding of computation patterns and optimality of certain parameters in MD simulations, thus allowing scientists to determine the optimal cell dimension in a linked-list cell method. The model has accurately estimated the number of operations during the simulations with the maximum error of 10.6% compared to actual measurements. Analysis and benchmark of the model have revealed that the optimal cell dimension minimizing the computation time is determined by a trade-off between decreasing search space and increasing linked-list cell access for smaller cells. ❧ One difficulty about MD is that it is a dynamic irregular application, which often suffers considerable performance deterioration during execution. To address this problem, an optimal data-reordering schedule has been developed for runtime memory-access optimization of MD simulations on parallel computers. Analysis of the memory-access penalty during MD simulations has shown that the performance improvement from computation and data reordering degrades gradually as data translation lookaside buffer misses increase. We have also found correlations between the performance degradation with physical properties such as the simulated temperature, as well as with computational parameters such as the spatial-decomposition granularity. Based on a performance model and pre-profiling of data fragmentation behaviors, we have developed an optimal runtime data-reordering schedule, thereby archiving speedup of 1.35, 1.36 and 1.28, respectively, for MD simulations of silica at temperatures 300 K, 3,000 K and 6,000 K. ❧ The main contributions of this dissertation research are two-fold: Metascalable hybrid message-passing and multithreading parallel algorithmic framework on emerging multicore parallel clusters, and a novel computation-pattern algebraic framework to design scalable algorithm for general n-tuple computation and prove its optimality in a mathematically rigorous manner. We expect that the proposed hybrid algorithms and mathematical approaches will provide a generic framework to a broad range of applications on future extreme-scale computing platforms.
创建时间:
2024-01-31



