five

Dataset used for training IoT C&C classifier

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6396922
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset was used for training the IoT C&C classifier. It is provided in the form of extended bidirectional flow data. The flow data were generated by ipfixprobe flow exporter and converted into CSV files. Apart from traditional flow information (IP addresses, ports, amount of transferred data), ipfixprobe was set with default timeouts (5 minutes active, 30 s inactive) to generate per-packet information for the first 30 packets. The flow records were then aggregated into 5-minute intervals - when the flow was split due to inactivity, the aggregator then stitched the flow back into a single one. The column headers in provided CSV files stand for: Column Name Description ipaddr DST_IP Source IP address ipaddr SRC_IP Destination IP address uint64 BYTES The number of transmitted bytes from SRC->DST uint64 BYTES_REV The number of transmitted bytes from DST->SRC time TIME_FIRST Timestamp of the first packet in the flow in format YYYY-MM-DDTHH-MM-SS time TIME_LAST Timestamp of the last packet in the flow in format YYYY-MM-DDTHH-MM-SS macaddr DST_MAC Destination MAC address macaddr SRC_MAC Source MAC address uint32 COUNT Number of aggregated flow records uint32 PACKETS The number of packets transmitted from Source to Destination uint32 PACKETS_REV The number of packets transmitted from Destination to Source uint16 DST_PORT Destination port uint16 SRC_PORT Source port uint8 DIR_BIT_FIELD Flag for distinguishin WAN(1)/LAN(0) uint8 PROTOCOL The number of transport protocol uint8 TCP_FLAGS Logic OR across all TCP flags in the packets transmitted SRC->DST uint8 TCP_FLAGS_REV Logic OR across all TCP flags in the packets transmitted DST->SRC int8* PPI_PKT_DIRECTIONS Array with packets' direction (1)- SRC->DST, (-1)-DST->SRC uint8* PPI_PKT_FLAGS Array with packets' TCP flags uint16* PPI_PKT_LENGTHS Array with packets' payload lengths time* PPI_PKT_TIMES Array with packets' timestamps Dataset consists of two parts: a benign part captured on the real ISP network and a malicious part captured in a lab environment. Bening part captured on the real ISP network This part was created by packet capturing on the metering points located at the perimeter of the CESNET2 network. The metering points monitor 100 Gbps backbone peering lines used by approximately half a million users. We performed packet filtering based on ports for the capture. The CESNET training capture was used as benign traffic in the C&C model training and testing pipeline to cover potential nuances and variability of benign data seen in the ISP-level network. Since we deal with data from the production network, we cannot guarantee a benign nature of all captured communication. However, we verified every IP address according to the internal blocklist of the CESNET association and external ones. We used AbuseIPDB and URLhaus blocklists. Since we are dealing with the real captures, the IP addresses, and MAC addresses were anonymized. Malicious part created in the controlled lab-created environment From leaked source codes, we picked one variant from each of the most prevalent client-server IoT botnet families: (1) Tsunami, (2) Gafgyt, (3) Mirai. Each implements a distinct communication protocol; Tsunami is an example of an IRC bot; Gafgyt uses a simple text-based protocol; Mirai implements a custom binary protocol. Afterward, we prepared virtualized testing environment. We deployed the malware in a controlled manner, filtering out its scanning and exploiting activities. The dataset covers the most notable C&C behavior. As previously recognized, the C&C communication consists of C&C heartbeat and bot commands. Thus, for each of the three prepared malware variants, we first imagine the malware running with no received commands. That includes the initiation of the TCP connection to the C&C server, which continues for one hour. And then, we imagine the malware receiving commands from its C&C server. The position of the command packets is chosen arbitrarily relative to the background heartbeat packets because, in the real-world scenario, the timing of the commands is tied to a random human action. Directory tree of provided dataset   . ├── README.md ├── benign │   ├── AN_p20-21-25-143-3389.agg.head.csv │   ├── AN_p22.agg.head.csv │   ├── AN_p443.agg.head.csv │   ├── AN_p80.agg.head.csv │   └── AN_p8080.agg.head.csv └── cnc     ├── kaiten     │   ├── cnc.csv     │   ├── command-01.csv     │   ├── command-02.csv     │   ├── command-03.csv     │   ├── command-04.csv     │   ├── command-05.csv     │   ├── command-06.csv     │   ├── command-07.csv     │   └── command-08.csv     ├── mirai     │   ├── cnc.csv     │   ├── command-01.csv     │   ├── command-02.csv     │   ├── command-03.csv     │   ├── command-04.csv     │   ├── command-05.csv     │   ├── command-06.csv     │   ├── command-07.csv     │   └── command-08.csv     └── qbot         ├── cnc.csv         ├── command-01.csv         ├── command-02.csv         ├── command-03.csv         └── command-04.csv Acknowledgment This research was funded by the Ministry of Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis and also by the Grant Agency of the CTU in Prague, grant No. SGS20/210/OHK3/3T/18 funded by the MEYS of the Czech Republic.
创建时间:
2022-03-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作