Datasets and source code for a pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs

Name: Datasets and source code for a pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs
Creator: doi.org
License: 暂无描述

doi.org2025-01-22 收录

下载链接：

http://doi.org/10.17632/hgkv9cpnmn.2

下载链接

链接失效反馈

官方服务：

资源简介：

This repository is composed of 2 compressed files, with the contents as next described. --- code.tar.gz --- The source code that implements the pipeline, as well as code and scripts needed to retrieve time series, create the plots or run the experiments. More specifically: + prepare.py and main.py ⇨ The Python programs that implement the pipeline, both the auxiliary and the main pipeline stages, respectively. + 'anomaly' and 'config' folders ⇨ Scripts and Python files containing the configuration and some basic functions that are used to retrieve the information needed to process the data, like the actual resource time series from OpenTSDB, or the job metadata from Slurm. + 'functions' folder ⇨ Several folders with the Python programs that implement all the stages of the pipeline, either for the Machine Learning processing (e.g., extractors, aggregators, models), or the technical aspect of the pipeline (e.g., pipelines, transformer). + plotDF.py ⇨ A Python program used to create the different plots presented, from the resource time series to the evaluation plots. + several bash scripts ⇨ Used to run the experiments using a specific configuration, whether regarding which transformers are chosen and how they are parametrized, or more technical aspects involving how the pipeline is executed. --- data.tar.gz --- The actual data and results, organized as follows: + jobs ⇨ All the jobs' resource time series plots for all the experiments, with a folder used for each experiment. Inside each folder all the jobs are separated according to their id, containing the plots for the different system resources (e.g., User CPU, Cached memory). + plots ⇨ All the predictions' plots for all the experiments in separated folders, mainly used for evaluation purposes (e.g., scatter plot, heatmaps, Andrews curves, dendrograms). These plots are available for all the predictors resulting from the pipeline execution. In addition, for each predictor it is also possible to visualize the resource time series grouped by clusters. Finally, the projections as generated by the dimension reduction models, and the outliers detected, are also available for each experiment. + datasets ⇨ The datasets used for the experiments, which include the lists of job IDs to be processed (CSV files) and the results of each stage of the pipeline (e.g., features, predictions), and the output text files as generated by several pipeline stages. Among these latter files it is worth to note the evaluation ones, that include all the predictions scores.

本仓库由2个压缩文件组成，其内容如下所述。 --- code.tar.gz --- 实现管道的源代码，以及检索时间序列、创建图表或运行实验所需的代码和脚本。具体而言： + prepare.py 和 main.py ⇨ 实现管道的Python程序，分别对应辅助和主要管道阶段。 + 'anomaly' 和 'config' 文件夹 ⇨ 包含配置和一些基本函数的脚本和Python文件，这些函数用于检索处理数据所需的信息，例如来自OpenTSDB的实际资源时间序列或来自Slurm的作业元数据。 + 'functions' 文件夹 ⇨ 包含实现管道所有阶段的Python程序，这些程序适用于机器学习处理（例如，提取器、聚合器、模型）或管道的技术方面（例如，管道、Transformer）的各个阶段。 + plotDF.py ⇨ 用于创建所展示的不同图表的Python程序，从资源时间序列到评估图表。 + 几个bash脚本 ⇨ 用于使用特定配置运行实验，无论是关于选择哪些Transformer及其参数化方式，还是涉及管道执行的技术方面。 --- data.tar.gz --- 实际数据和结果，组织如下： + jobs ⇨ 所有实验的资源时间序列图表，每个实验都有一个文件夹。在每个文件夹内，所有作业都根据其ID分离，包含不同系统资源（例如，用户CPU、缓存内存）的图表。 + plots ⇨ 所有实验的预测图表，分别存放在独立的文件夹中，主要用于评估目的（例如，散点图、热图、Andrews曲线、树状图）。这些图表适用于管道执行结果的所有预测器。此外，对于每个预测器，还可以可视化按聚类分组的时间序列资源，以及由降维模型生成的投影和每个实验中检测到的异常值。 + datasets ⇨ 用于实验的数据集，包括待处理作业ID列表（CSV文件）和管道每个阶段的输出结果（例如，特征、预测），以及由多个管道阶段生成的输出文本文件。在这些文件中，特别值得注意的是评估文件，它们包含了所有预测分数。

提供机构：

doi.org

5,000+

优质数据集

54 个

任务类型

进入经典数据集