Datasets and source code for a pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://data.mendeley.com/datasets/hgkv9cpnmn
下载链接
链接失效反馈官方服务:
资源简介:
This repository is composed of 2 compressed files, with the contents as next described.
--- code.tar.gz ---
The source code that implements the pipeline, as well as code and scripts needed to retrieve time series,
create the plots or run the experiments. More specifically:
+ prepare.py and main.py ⇨
The Python programs that implement the pipeline, both the auxiliary and the main pipeline
stages, respectively.
+ 'anomaly' and 'config' folders ⇨
Scripts and Python files containing the configuration and some basic functions that are
used to retrieve the information needed to process the data, like the actual resource
time series from OpenTSDB, or the job metadata from Slurm.
+ 'functions' folder ⇨
Several folders with the Python programs that implement all the stages of the pipeline,
either for the Machine Learning processing (e.g., extractors, aggregators, models), or
the technical aspect of the pipeline (e.g., pipelines, transformer).
+ plotDF.py ⇨
A Python program used to create the different plots presented, from the resource time
series to the evaluation plots.
+ several bash scripts ⇨
Used to run the experiments using a specific configuration, whether regarding which
transformers are chosen and how they are parametrized, or more technical aspects
involving how the pipeline is executed.
--- data.tar.gz ---
The actual data and results, organized as follows:
+ jobs ⇨
All the jobs' resource time series plots for all the experiments, with a folder used
for each experiment. Inside each folder all the jobs are separated according to their
id, containing the plots for the different system resources (e.g., User CPU, Cached memory).
+ plots ⇨
All the predictions' plots for all the experiments in separated folders, mainly used for
evaluation purposes (e.g., scatter plot, heatmaps, Andrews curves, dendrograms). These
plots are available for all the predictors resulting from the pipeline execution. In
addition, for each predictor it is also possible to visualize the resource time series
grouped by clusters. Finally, the projections as generated by the dimension reduction
models, and the outliers detected, are also available for each experiment.
+ datasets ⇨
The datasets used for the experiments, which include the lists of job IDs to be processed
(CSV files) and the results of each stage of the pipeline (e.g., features, predictions),
and the output text files as generated by several pipeline stages. Among these latter
files it is worth to note the evaluation ones, that include all the predictions scores.
创建时间:
2022-12-14



