five

RBD24 - Risk Activities Dataset 2024

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13787590
下载链接
链接失效反馈
官方服务:
资源简介:
Introduction This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.  This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data"  published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290     Summary of the Datasets  The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices. DatasetId Entity  Observed Behaviour Groundtruth Sample Shape Crypto_desktop.parquet DE Miner Checking IDS 0: 738/161202, 1: 11/1343 Crypto_smarphone.parquet SM Miner Checking IDS 0: 613/180021, 1: 4/956 OutFlash_desktop.parquet DE Outdated software components  IDS 0: 738/161202, 1: 56/10820 OutFlash_smartphone.parquet SM Outdated software components  IDS 0: 613/180021, 1: 22/6639 OutTLS_desktop.parquet DE Outdated TLS protocol IDS 0: 738/161202, 1: 18/2458 OutTLS_smartphone.parquet SM Outdated TLS protocol IDS 0: 613/180021, 1: 11/2930 P2P_desktop.parquet DE P2P Activity IDS 0: 738/161202, 1: 177/35892 P2P_smartphone.parquet SM P2P Activity IDS 0: 613/180021, 1: 94/21688 NonEnc_desktop.parquet DE Non-encrypted password IDS 0: 738/161202, 1: 291/59943 NonEnc_smaprthone.parquet SM Non-encrypted password IDS 0: 613/180021, 1: 167/41434 Phishing_desktop.parquet DE Phishing email Experimental Campaign 0: 98/13864, 1: 19/3072 Phishing_smartphone.parquet SM Phishing email Experimental Campaign 0: 117/34006, 1: 26/8968    Methodology   To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build ground truth are as follows: - Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.- IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic. For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating user behavior within a fixed time interval. This TW serves as the basis for generating various supervised and unsupervised methods. Sample Representation The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated interpretable features designed to describe device-level properties within the specified time frame. The most influential features are described below.  User:** A unique hash value that identifies a user. Timestamp:** The timestamp of the windows. Features Label:  1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.   Dataset Format Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization.  The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the  fastparquet library installed: ```pythonimport pandas as pd # Reading a Parquet filedf = pd.read_parquet(  'path_to_your_file.parquet',   engine='fastparquet'  )  ```
创建时间:
2025-03-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作