Dataset used for training IoT C&C classifier
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6396922
下载链接
链接失效反馈官方服务:
资源简介:
This dataset was used for training the IoT C&C classifier. It is provided in the form of extended bidirectional flow data. The flow data were generated by ipfixprobe flow exporter and converted into CSV files. Apart from traditional flow information (IP addresses, ports, amount of transferred data), ipfixprobe was set with default timeouts (5 minutes active, 30 s inactive) to generate per-packet information for the first 30 packets. The flow records were then aggregated into 5-minute intervals - when the flow was split due to inactivity, the aggregator then stitched the flow back into a single one.
The column headers in provided CSV files stand for:
Column Name
Description
ipaddr DST_IP
Source IP address
ipaddr SRC_IP
Destination IP address
uint64 BYTES
The number of transmitted bytes from SRC->DST
uint64 BYTES_REV
The number of transmitted bytes from DST->SRC
time TIME_FIRST
Timestamp of the first packet in the flow in format YYYY-MM-DDTHH-MM-SS
time TIME_LAST
Timestamp of the last packet in the flow in format YYYY-MM-DDTHH-MM-SS
macaddr DST_MAC
Destination MAC address
macaddr SRC_MAC
Source MAC address
uint32 COUNT
Number of aggregated flow records
uint32 PACKETS
The number of packets transmitted from Source to Destination
uint32 PACKETS_REV
The number of packets transmitted from Destination to Source
uint16 DST_PORT
Destination port
uint16 SRC_PORT
Source port
uint8 DIR_BIT_FIELD
Flag for distinguishin WAN(1)/LAN(0)
uint8 PROTOCOL
The number of transport protocol
uint8 TCP_FLAGS
Logic OR across all TCP flags in the packets transmitted SRC->DST
uint8 TCP_FLAGS_REV
Logic OR across all TCP flags in the packets transmitted DST->SRC
int8* PPI_PKT_DIRECTIONS
Array with packets' direction (1)- SRC->DST, (-1)-DST->SRC
uint8* PPI_PKT_FLAGS
Array with packets' TCP flags
uint16* PPI_PKT_LENGTHS
Array with packets' payload lengths
time* PPI_PKT_TIMES
Array with packets' timestamps
Dataset consists of two parts: a benign part captured on the real ISP network and a malicious part captured in a lab environment.
Bening part captured on the real ISP network
This part was created by packet capturing on the metering points located at the perimeter of the CESNET2 network. The metering points monitor 100 Gbps backbone peering lines used by approximately half a million users. We performed packet filtering based on ports for the capture. The CESNET training capture was used as benign traffic in the C&C model training and testing pipeline to cover potential nuances and variability of benign data seen in the ISP-level network. Since we deal with data from the production network,
we cannot guarantee a benign nature of all captured communication. However, we verified every IP address according to the internal blocklist of the CESNET association and external ones. We used AbuseIPDB and URLhaus blocklists.
Since we are dealing with the real captures, the IP addresses, and MAC addresses
were anonymized.
Malicious part created in the controlled lab-created environment
From leaked source codes, we picked one variant from each of the most prevalent client-server IoT botnet families: (1) Tsunami, (2) Gafgyt, (3) Mirai. Each implements a distinct communication protocol; Tsunami is an example of an IRC bot; Gafgyt
uses a simple text-based protocol; Mirai implements a custom binary protocol. Afterward, we prepared virtualized testing environment.
We deployed the malware in a controlled manner, filtering out its scanning and exploiting activities. The dataset covers the most notable C&C behavior. As previously recognized, the C&C communication consists of C&C heartbeat and
bot commands. Thus, for each of the three prepared malware variants, we first imagine the malware running with no received commands. That includes the initiation of the TCP connection to the C&C server, which continues for one hour. And then, we imagine the malware receiving commands from its C&C server. The position of the command packets is chosen arbitrarily relative to the background heartbeat packets because, in the real-world scenario, the timing of the commands is tied to a random human action.
Directory tree of provided dataset
.
├── README.md
├── benign
│ ├── AN_p20-21-25-143-3389.agg.head.csv
│ ├── AN_p22.agg.head.csv
│ ├── AN_p443.agg.head.csv
│ ├── AN_p80.agg.head.csv
│ └── AN_p8080.agg.head.csv
└── cnc
├── kaiten
│ ├── cnc.csv
│ ├── command-01.csv
│ ├── command-02.csv
│ ├── command-03.csv
│ ├── command-04.csv
│ ├── command-05.csv
│ ├── command-06.csv
│ ├── command-07.csv
│ └── command-08.csv
├── mirai
│ ├── cnc.csv
│ ├── command-01.csv
│ ├── command-02.csv
│ ├── command-03.csv
│ ├── command-04.csv
│ ├── command-05.csv
│ ├── command-06.csv
│ ├── command-07.csv
│ └── command-08.csv
└── qbot
├── cnc.csv
├── command-01.csv
├── command-02.csv
├── command-03.csv
└── command-04.csv
Acknowledgment
This research was funded by the Ministry of Interior of the Czech Republic,
grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis and also by the
Grant Agency of the CTU in Prague, grant No. SGS20/210/OHK3/3T/18 funded by
the MEYS of the Czech Republic.
创建时间:
2022-03-31



