RBD24 - Risk Activities Dataset 2024
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13787590
下载链接
链接失效反馈官方服务:
资源简介:
Introduction
This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.
This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290
Summary of the Datasets
The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.
DatasetId
Entity
Observed Behaviour
Groundtruth
Sample Shape
Crypto_desktop.parquet
DE
Miner Checking
IDS
0: 738/161202, 1: 11/1343
Crypto_smarphone.parquet
SM
Miner Checking
IDS
0: 613/180021, 1: 4/956
OutFlash_desktop.parquet
DE
Outdated software components
IDS
0: 738/161202, 1: 56/10820
OutFlash_smartphone.parquet
SM
Outdated software components
IDS
0: 613/180021, 1: 22/6639
OutTLS_desktop.parquet
DE
Outdated TLS protocol
IDS
0: 738/161202, 1: 18/2458
OutTLS_smartphone.parquet
SM
Outdated TLS protocol
IDS
0: 613/180021, 1: 11/2930
P2P_desktop.parquet
DE
P2P Activity
IDS
0: 738/161202, 1: 177/35892
P2P_smartphone.parquet
SM
P2P Activity
IDS
0: 613/180021, 1: 94/21688
NonEnc_desktop.parquet
DE
Non-encrypted password
IDS
0: 738/161202, 1: 291/59943
NonEnc_smaprthone.parquet
SM
Non-encrypted password
IDS
0: 613/180021, 1: 167/41434
Phishing_desktop.parquet
DE
Phishing email
Experimental Campaign
0: 98/13864, 1: 19/3072
Phishing_smartphone.parquet
SM
Phishing email
Experimental Campaign
0: 117/34006, 1: 26/8968
Methodology
To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build ground truth are as follows:
- Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.- IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.
For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating user behavior within a fixed time interval. This TW serves as the basis for generating various supervised and unsupervised methods.
Sample Representation
The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated interpretable features designed to describe device-level properties within the specified time frame. The most influential features are described below.
User:** A unique hash value that identifies a user.
Timestamp:** The timestamp of the windows.
Features
Label: 1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.
Dataset Format
Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:
```pythonimport pandas as pd
# Reading a Parquet filedf = pd.read_parquet( 'path_to_your_file.parquet', engine='fastparquet' )
```
创建时间:
2025-03-04



