Federated Learning for Distributed Intrusion Detection Systems in Public Networks - Validation Dataset
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7956303
下载链接
链接失效反馈官方服务:
资源简介:
This dataset has been meticulously prepared and utilized as a validation set during the evaluation phase of "Meta IDS" to asses the performance of various machine learning models. It is now made available for interested users and researchers who seek a reliable and diverse dataset for training and testing their own custom models.
The validation dataset comprises a comprehensive collection of labeled entries, that determines whether the packet type is "malicious" or "benign." It covers complex design patterns that are commonly encountered in real-world applications. The dataset is designed to be representative, encompassing edge and fog layers that are in contact with cloud layer, thereby enabling thorough testing and evaluation of different models. Each sample in the dataset is labeled with the corresponding ground truth, providing a reliable reference for model performance evaluation.
To ensure convenient distribution and storage, the dataset has been broken down into three separate batches, each containing a portion of the dataset. This allows for convenient downloading and management of the dataset. The three batches are provided as individual compressed files.
In order to extract the data, follow the following instructions:
Download and install bzip2 (if not already installed) from the official website or your package manager.
Place the compressed dataset file in a directory of your choice.
Open a terminal or command prompt and navigate to the directory where the compressed dataset file is located.
Execute the following command to uncompress the dataset:
bzip2 -d filename.bz2
Replace "filename.bz2" with the actual name of the compressed dataset file.
Once uncompressed, you will have access to the dataset in its original format for further exploration, analysis, and model training etc. The total storage required for extraction is approximately 800 GB in total, with the first batch requiring approximately 302 GB, the second batch requiring approximately 203 GB, and the third batch requiring approximately 297 GB of data storage.
The first batch contains 1,049,527,992 entries, where as the second batch contains 711,043,331 entries, and for the third and last batch we have 1,029,303,062 entries. The following table provides the feature names along with their explanation and example value once the dataset is extracted.
Feature
Description
Example Value
ip.src
Source IP address in the packet
a05d4ecc38da01406c9635ec694917e969622160e728495e3169f62822444e17
ip.dst
Destination IP address in the packet
a52db0d87623d8a25d0db324d74f0900deb5ca4ec8ad9f346114db134e040ec5
frame.time_epoch
Epoch time of the frame
1676165569.930869
arp.hw.type
Hardware type
1
arp.hw.size
Hardware size
6
arp.proto.size
Protocol size
4
arp.opcode
Opcode
2
data.len
Length
2713
eth.dst.lg
Destination LG bit
1
eth.dst.ig
Destination IG bit
1
eth.src.lg
Source LG bit
1
eth.src.ig
Source IG bit
1
frame.offset_shift
Time shift for this packet
0
frame.len
frame length on the wire
1208
frame.cap_len
Frame length stored into the capture file
215
frame.marked
Frame is marked
0
frame.ignored
Frame is ignored
0
frame.encap_type
Encapsulation type
1
gre
Generic Routing Encapsulation
'Generic Routing
Encapsulation (IP)’
ip.version
Version
6
ip.hdr_len
Header length
24
ip.dsfield.dscp
Differentiated Services
Codepoint
56
ip.dsfield.ecn
Explicit Congestion
Notification
2
ip.len
Total length
614
ip.flags.rb
Reserved bit
0
ip.flags.df
Don't fragment
1
ip.flags.mf
More fragments
0
ip.frag_offset
Fragment offset
0
ip.ttl
Time to live
31
ip.proto
Protocol
47
ip.checksum.status
Header checksum status
2
tcp.srcport
TCP source port
53425
tcp.flags
Flags
0x00000098
tcp.flags.ns
Nonce
0
tcp.flags.cwr
Congestion Window Reduced
(CWR)
1
udp.srcport
UDP source port
64413
udp.dstport
UDP destination port
54087
udp.stream
Stream index
1345
udp.length
Length
225
udp.checksum.status
Checksum status
3
packet_type
Type of the packet which is either "benign" or "malicious"
0
Furthermore, in compliance with the GDPR and to ensure the privacy of individuals, all IP addresses present in the dataset have been anonymized through hashing. This anonymization process helps protect the identity of individuals while preserving the integrity and utility of the dataset for research and model development purposes.
Please note that while the dataset provides valuable insights and a solid foundation for machine learning tasks, it is not a substitute for extensive real-world data collection. However, it serves as a valuable resource for researchers, practitioners, and enthusiasts in the machine learning community, offering a compliant and anonymized dataset for developing and validating custom models in a specific problem domain.
By leveraging the validation dataset for machine learning model evaluation and custom model training, users can accelerate their research and development efforts, building upon the knowledge gained from my thesis while contributing to the advancement of the field.
创建时间:
2023-05-23



