Generic and ML Workloads in an HPC Datacenter

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/11028933

下载链接

链接失效反馈

官方服务：

资源简介：

Updated Version of the previous upload, adjusts node timestamps lacking behind at the beginning of the data collection. This archive contains hardware and workload traces from SURF Lisa, a Dutch datacenter consisting of 338 nodes, used by universities and researchers for various jobs. Around 85% of the nodes are equipped only with CPUs, handling generic compute-heavy workloads, the other 15% come with additional GPUs, serving as accelerators for Machine Learning (ML) jobs. Individual node hardware configurations are listed in `node_hardware_info.parquet`. Jobs within Lisa are submitted over the SLURM scheduler, where we logged job start and end time, resource allocation, and exit state for roughly 10 months (December 2021 to November 2022). This data saved in `slurm_table_cleaned.parquet`. Addidionally, we provide detailed Prometheus monitoring logs from all nodes over a timespan of 5 months (June 2022 to November 2022) in `prom_table_cleaned.parquet`. These logs contain over 90 attributes, including CPU/GPU power and temperatures, network I/O, memory and storage usage, and many more. These metrics are sampled at 30s intervals, resulting in a total of almost 130 million records across all nodes. Finally, job and node data are provided as a joined dataset in `prom_slurm_joined.parquet` for their 4 months of overlapping timespan. This combined data can provide more insights into the resource consumption and performance patterns of jobs. We conducted detailed analysis of this data where we specifically looked at the different characteristics of generic vs. ML workloads in a heterogeneous HPC environment. The pre-print of our analysis work can be found on arXiv. Our code used for evaluation can be found on GitHub. Dataset Name Explanation slurm_table_cleaned.parquet Job data collected by SLURM prom_table_cleaned.parquet Node data collected by Prometheus prom_slurm_joined.parquet Joined Job and Node dataset node_hardware_info.parquet Hardware configurations of each node

本版本为此前上传版本的更新，修复了数据采集初期节点时间戳滞后的问题。该归档文件包含来自荷兰SURF Lisa数据中心的硬件与工作负载追踪数据。SURF Lisa是一座拥有338个节点的荷兰数据中心，面向高校与科研人员提供各类计算任务服务。其中约85%的节点仅配备CPU，用于处理通用计算密集型工作负载；剩余15%的节点搭载额外GPU，作为机器学习（Machine Learning, ML）任务的加速硬件。各节点的硬件配置详情已收录于`node_hardware_info.parquet`文件中。 Lisa平台上的作业通过SLURM调度器提交，我们记录了约10个月（2021年12月至2022年11月）间的作业启动与结束时间、资源分配情况以及退出状态，相关数据存储于`slurm_table_cleaned.parquet`文件中。此外，我们还提供了2022年6月至2022年11月共5个月内，所有节点的详细Prometheus监控日志，存储于`prom_table_cleaned.parquet`文件中。该日志包含超过90项指标，涵盖CPU/GPU功耗与温度、网络I/O、内存与存储使用率等多项内容，采样间隔为30秒，所有节点累计生成近1.3亿条记录。最后，我们提供了二者重叠的4个月时间段内的作业与节点联合数据集`prom_slurm_joined.parquet`，该整合数据可用于深入分析作业的资源消耗与性能模式。我们针对该数据集开展了详细分析，重点研究了异构高性能计算（High Performance Computing, HPC）环境下通用工作负载与ML工作负载的差异化特征。相关分析的预印本可在arXiv平台获取，用于评估的代码已托管至GitHub。 ### 数据集名称及说明 - `slurm_table_cleaned.parquet`：通过SLURM采集的作业数据 - `prom_table_cleaned.parquet`：通过Prometheus采集的节点数据 - `prom_slurm_joined.parquet`：作业与节点的联合数据集 - `node_hardware_info.parquet`：各节点的硬件配置信息

创建时间：

2024-10-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集