Artifact for Taxonomist: Application Detection through Rich Monitoring Data

Mendeley Data2024-06-25 更新2024-06-30 收录

下载链接：

https://springernature.figshare.com/articles/Artifact_for_Taxonomist_Application_Detection_through_Rich_Monitoring_Data/6384248/1

下载链接

链接失效反馈

官方服务：

资源简介：

Code, documentation, data and Jupyter Notebook associated with the publication "Taxonomist: Application Detection Through Rich Monitoring Data" for the European Conference on Parallel Processing 2018. The related study develops a technique named 'Taxonomist' to identify applications running on supercomputers, using machine learning to classify known applications and detect unknown applications. The technique uses monitoring data such as CPU and memory usage metrics and hardware counters collected from supercomputers. The aims of this technique include providing an alternative to 'naive' application detection methods based on names of processes and scripts, and helping prevent fraud, waste and abuse in supercomputers. Taxonomist uses supervised learning techniques to automatically select the most relevant features that lead to reliable application identification. The process involves the following steps: 1. Monitoring data is collected from every compute node in a time series format.2. 11 statistical features are extracted over the time series (e.g. percentiles, minimum, maximum, mean), thus reducing storage and computation overhead.3. A classifier is trained based on a set of labeled applications, based on a 'one-versus-rest' version of that classifier - effectively for each application in the training set a separate classifier is trained to differentiate that application. The dataset consists of: README.pdf - user guide for the 'Taxonomist' artifact outlining installation and instructions for using the Jupyter notebook, as well as code omissions in notebook compared to a described in Euro-Par 2018 process.taxonomist.py - Python file including a basic version of the Taxonomist framework. The module contents can be imported for other projects.noteboook.html - static HTML version of the notebook that can be viewed by a browser.notebook.ipynb - interactive Jupyter Notebook file, for operation see README.pdf.data.zip - compressed .zip file holding monitoring data collected from different applications executed on Volta:- metadata.csv: A csv file listing each run, the IDs of the nodes on which each run executed, which application was executed with which inputs, the start and end times and the duration of the applications. - timeseries.tar.bz2: A bzip2 compressed file containing the data collected. The uncompressed size is 16 GB, it is not necessary to uncompress for most of the notebook. - features.hdf: A HDF5 File containing the pre-calculated features. The calculation process is included in the notebook.requirements.txt - list of Python packages required.LICENSE - the licence under which this software is released Files are in in openly accessible Python language (.py and ipynb), .html. pdf, .csv, .txt .zip and Hierarchical Data Format .hdf formats. Experimental set-up for the experiments reported in the related publication uses Volta, a Cray XC30m supercomputer located at Sandia National Laboratories, as well as the open source monitoring tool Lightweight Distributed Metric System (LDMS).

本数据集关联2018年欧洲并行计算会议（European Conference on Parallel Processing, Euro-Par 2018）发表论文《Taxonomist: 基于丰富监控数据的应用检测》的代码、文档、数据与Jupyter Notebook (Jupyter Notebook)。相关研究提出了一种名为“Taxonomist”的技术，用于识别超级计算机上运行的应用程序：该技术借助机器学习对已知应用进行分类，并检测未知应用。其所用监控数据涵盖从超级计算机采集的CPU、内存使用率指标以及硬件计数器数据。该技术旨在为基于进程与脚本名称的“朴素 (naive)”应用检测方法提供替代方案，并助力防范超级计算机使用中的欺诈、浪费与滥用行为。 Taxonomist采用监督学习技术，自动选取最相关的特征以实现可靠的应用识别。其流程包含以下步骤：1. 以时间序列格式从每个计算节点采集监控数据；2. 对时间序列提取11种统计特征（如百分位数、最小值、最大值、均值等），以此降低存储与计算开销；3. 基于标记应用集训练分类器，采用“一对多 (one-versus-rest)”策略：即针对训练集中的每个应用，分别训练独立分类器以实现该应用的区分。本数据集包含以下文件： - README.pdf：“Taxonomist”研究工件的用户指南，概述了安装方法、Jupyter Notebook使用说明，以及与Euro-Par 2018论文描述相比，Notebook中存在的代码疏漏。 - taxonomist.py：包含Taxonomist框架基础版本的Python文件，该模块可被导入至其他项目中使用。 - notebook.html：可通过浏览器查看的Notebook静态HTML版本。 - notebook.ipynb：交互式Jupyter Notebook文件，操作说明详见README.pdf。 - data.zip：压缩归档文件，存储了在Volta超级计算机上运行不同应用所采集的监控数据，内含： - metadata.csv：CSV格式文件，列出了每一次运行的相关信息，包括各运行所在的节点ID、所执行的应用程序及其输入参数、应用的启动与结束时间以及运行时长。 - timeseries.tar.bz2：采用bzip2压缩的时间序列数据文件，未压缩大小为16 GB，多数Notebook操作无需解压该文件。 - features.hdf：存储预计算特征的HDF5 (Hierarchical Data Format 5)文件，特征计算过程详见Notebook。 - requirements.txt：所需Python依赖包列表。 - LICENSE：本软件的发布许可证。本数据集的文件采用公开可访问的格式：Python语言文件（.py与.ipynb）、.html、.pdf、.csv、.txt、.zip以及分层数据格式（Hierarchical Data Format, .hdf）。相关论文中报告的实验所采用的实验环境为：位于桑迪亚国家实验室的Cray XC30m超级计算机Volta，以及开源监控工具轻量级分布式度量系统 (Lightweight Distributed Metric System, LDMS)。

创建时间：

2023-06-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集