HFlow: efficiently manage high-throughput applications on HPC systems
收藏中国科学数据2026-02-05 更新2026-04-25 收录
下载链接:
https://www.sciengine.com/AA/doi/10.1360/SSI-2025-0222
下载链接
链接失效反馈官方服务:
资源简介:
High-throughput computing (HTC) typically executes a vast number of small-scale, short-duration, and mutually independent computational tasks. Although high-performance computing (HPC) systems possess abundant computational resources, mainstream resource management systems and existing HTC-oriented solutions exhibit significant deficiencies in throughput, application compatibility, and fault tolerance, resulting in inefficient resource management for HTC applications on HPC systems. To address this challenge, this paper proposes HFlow—a resource management solution integrating centralized and distributed resource management architectures. HFlow achieves high application compatibility through a hybrid job management mechanism and concurrently enhances throughput and fault tolerance via a fine-grained task partitioning algorithm coupled with a multi-level fault tolerance framework. Experimental evaluations on the Tianhe-2A supercomputer demonstrate that HFlow maintains HPC application management efficiency while successfully supporting HTC resource management requirements. Specifically, HFlow delivers task throughput 2.1× to 108.3× higher than mainstream resource management systems and dedicated HTC solutions, alongside robust multi-level fault tolerance capabilities.
创建时间:
2025-11-18



