Running Kubernetes Workloads on Rootless HPC Systems using Slurm
收藏GRO.data2024-01-01 更新2026-04-17 收录
下载链接:
https://data.goettingen-research-online.de/citation?persistentId=doi:10.25625/GDFCFP
下载链接
链接失效反馈官方服务:
资源简介:
Kubernetes, the leading container orchestration platform in today’s market, empowers the deployment, scaling, and management of containerized applications on cloud infrastructure. In contrast to that, in High-Performance Computing (HPC), users utilize multi-node compute systems comprising powerful bare-metal servers. To submit and manage workloads, these systems usually run workload managers such as Slurm. By combining the two worlds, you can potentially take advantage of both Slurm and Kubernetes, as well as HPC and cloud features, to create usable, high-performance sys- tems. Due to the different operation paradigms of Kubernetes and Slurm, its integration imposes several challenges. As an example, an important factor are root-privileges, which are generally not available for operational use on HPC machines. This limits the available system features and consequently the usage of software including integra- tion approaches for Slurm and Kubernetes. Additional challenges arise with respect to compute, storage, and network performance, as well as regarding usability and maintainability of such approaches. Existing approaches such as Bridge-Operator by IBM, WLM-operator by Sylabs Inc., kube-slurm by Kalen Peterson, slurm-k8s-bridge by SchedMD, and HPK by the foun- dation for research and technology Hellas, explore solutions to this challenges. However, no solution supports all of the core Kubernetes features for use on the Slurm cluster. In this thesis, in order to address this gap, we propose a new approach called Kind Slurm Integration (KSI) that creates a temporary Kubernetes cluster in a rootless Podman container for workload execution. To provide a comparison as well as an assessment for certain use cases involving several functional requirements, quality requirements, and constraints, we provide a systematic evaluation of all relevant integration approaches. Major results are significant differences regarding startup time, storage and network performance. With respect to our findings, each project excels in certain use cases. We conclude that there is no definite solution yet, thus more research on benchmarking, optimization, and integration is needed.
创建时间:
2024-01-01



