Datasets and source code for a pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs

Mendeley Data2024-03-27 更新2024-06-26 收录

下载链接：

https://data.mendeley.com/datasets/hgkv9cpnmn

下载链接

链接失效反馈

官方服务：

资源简介：

This repository is composed of 2 compressed files, with the contents as next described. --- code.tar.gz --- The source code that implements the pipeline, as well as code and scripts needed to retrieve time series, create the plots or run the experiments. More specifically: + prepare.py and main.py ⇨ The Python programs that implement the pipeline, both the auxiliary and the main pipeline stages, respectively. + 'anomaly' and 'config' folders ⇨ Scripts and Python files containing the configuration and some basic functions that are used to retrieve the information needed to process the data, like the actual resource time series from OpenTSDB, or the job metadata from Slurm. + 'functions' folder ⇨ Several folders with the Python programs that implement all the stages of the pipeline, either for the Machine Learning processing (e.g., extractors, aggregators, models), or the technical aspect of the pipeline (e.g., pipelines, transformer). + plotDF.py ⇨ A Python program used to create the different plots presented, from the resource time series to the evaluation plots. + several bash scripts ⇨ Used to run the experiments using a specific configuration, whether regarding which transformers are chosen and how they are parametrized, or more technical aspects involving how the pipeline is executed. --- data.tar.gz --- The actual data and results, organized as follows: + jobs ⇨ All the jobs' resource time series plots for all the experiments, with a folder used for each experiment. Inside each folder all the jobs are separated according to their id, containing the plots for the different system resources (e.g., User CPU, Cached memory). + plots ⇨ All the predictions' plots for all the experiments in separated folders, mainly used for evaluation purposes (e.g., scatter plot, heatmaps, Andrews curves, dendrograms). These plots are available for all the predictors resulting from the pipeline execution. In addition, for each predictor it is also possible to visualize the resource time series grouped by clusters. Finally, the projections as generated by the dimension reduction models, and the outliers detected, are also available for each experiment. + datasets ⇨ The datasets used for the experiments, which include the lists of job IDs to be processed (CSV files) and the results of each stage of the pipeline (e.g., features, predictions), and the output text files as generated by several pipeline stages. Among these latter files it is worth to note the evaluation ones, that include all the predictions scores.

本仓库包含2个压缩文件，内容详述如下。 --- code.tar.gz --- 该压缩包包含实现数据处理流水线的源代码，以及获取时间序列、生成图表或运行实验所需的代码与脚本。具体内容如下： + prepare.py 与 main.py ⇨ 分别实现流水线辅助阶段与主阶段的Python程序。 + 'anomaly' 与 'config' 文件夹 ⇨ 存放配置文件与基础函数的脚本及Python文件，用于获取处理数据所需的信息，例如来自OpenTSDB的实际资源时间序列，或是Slurm的作业元数据。 + 'functions' 文件夹 ⇨ 包含多个子文件夹，其中的Python程序实现了流水线的全部阶段，涵盖机器学习处理环节（如特征提取器、聚合器、模型）以及流水线的技术实现环节（如流水线框架、Transformer（Transformer））。 + plotDF.py ⇨ 用于生成各类图表的Python程序，涵盖资源时间序列图表与模型评估图表。 + 多个Bash脚本 ⇨ 用于基于指定配置运行实验，配置项包括所选Transformer及其参数，或是涉及流水线执行方式的技术细节。 --- data.tar.gz --- 该压缩包包含实际实验数据与结果，组织结构如下： + jobs ⇨ 所有实验的作业资源时间序列图表，每个实验对应一个独立文件夹。每个文件夹内的作业按ID分类，存放各类系统资源的图表（如用户CPU使用率、缓存内存使用情况）。 + plots ⇨ 所有实验的模型预测结果图表，按实验分文件夹存储，主要用于模型评估，例如散点图、热力图、Andrews曲线、树状图。这些图表涵盖流水线执行得到的所有预测器的结果。此外，每个预测器还支持查看按聚类分组的资源时间序列。最后，各实验还提供了降维模型生成的投影结果，以及检测到的异常值。 + datasets ⇨ 实验所用数据集，包括待处理作业ID列表（CSV格式文件）、流水线各阶段的输出结果（如特征数据、预测结果），以及流水线各阶段生成的输出文本文件。其中值得关注的是评估相关文件，包含所有预测任务的评分结果。

创建时间：

2024-01-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集