Ransomware and user samples for training and validating ML models

Mendeley Data2024-03-27 更新2024-06-26 收录

下载链接：

https://data.mendeley.com/datasets/yhg5wk39kf

下载链接

链接失效反馈

官方服务：

资源简介：

Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected. This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder. The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds. Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples. In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set). The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample. Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.

近年来，勒索软件（Ransomware）已成为多数企业面临的重大威胁。在用户可访问共享服务器上全部文件的场景中，单台受感染主机即可锁定所有共享文件的访问权限。本关联仓库的相关论文中，我们基于文件共享流量分析实现勒索软件感染检测，即便在流量加密的场景下依然有效。我们对比了三种机器学习模型，并选取最优模型用于验证。我们使用来自26个不同恶意软件家族的70余个勒索软件二进制文件，以及真实用户产生的2500余小时未感染流量，对该检测模型进行训练与测试。实验结果表明，所提出的工具可检测出所有勒索软件二进制文件，包括训练阶段未使用的零日（zero-day）样本。本论文通过分析误报率，以及勒索软件在被检测前可加密的用户文件信息量，对所提算法进行了验证。本数据集目录包含"受感染"与"未受感染"样本，以及针对每种T配置的模型文件，所有内容均存储于独立文件夹中。文件夹命名格式为NxSy，其中x代表每个样本的1秒间隔数，y代表滑动步长（单位：秒）。每个文件夹（例如N10S10/）包含以下内容： - tree.py：实现决策树（Tree）模型的Python脚本 - ensemble.json：存储集成模型（Ensemble model）信息的JSON文件 - NN_XhiddenLayer.json：存储带有X个隐藏层（1、2或3层）的神经网络（NN）模型信息的JSON文件 - N10S10.csv：当前文件夹中用于训练各模型的全部样本，格式为CSV，可用于bigML应用 - zeroDays.csv：当前文件夹中用于测试各模型的全部零日样本，格式为CSV，可用于bigML应用 - userSamples_test：当前文件夹中用于验证各模型的全部样本，格式为CSV，可用于bigML应用 - userSamples_train：用于训练模型的用户样本 - ransomware_train：用于训练模型的勒索软件样本 - scaler.scaler：用于样本标准化的Python库标准化缩放器（Standard Scaler） - zeroDays_notFiltered：存储零日样本的文件夹针对N30S30文件夹，还额外包含一个（SMBv2SMBv3NFS）文件夹，其中存储了从SMBv2、SMBv3及网络文件系统（NFS）流量轨迹中提取的样本。本数据集包含的二进制文件数量多于论文中呈现的数量，原因是部分文件并非"未见过"的样本（其所属恶意软件家族已包含在训练集中）。存储样本的文件（NxSy.csv、zeroDays.csv及userSamples_test.csv）结构如下： - 每行代表一个样本 - 每个样本包含3*T个特征与一个标签（1代表"受感染"样本，0代表"未受感染"样本） - 由于是CSV文件，特征间以逗号分隔 - 最后一列为样本的标签此外，根目录下存放了两个pcap文件，它们是用于对比SMB不同版本的流量轨迹。

创建时间：

2024-01-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集