CESNET-TLS-Year22: A year-spanning TLS network traffic dataset from backbone lines
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/10608606
下载链接
链接失效反馈官方服务:
资源简介:
We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo.
The modern approach for network traffic classification (TC), which is an important part of operating and securing networks, is to use machine learning (ML) models that are able to learn intricate relationships between traffic characteristics and communicating applications. A crucial prerequisite is having representative datasets. However, datasets collected from real production networks are not being published in sufficient numbers. Thus, this paper presents a novel dataset, CESNET-TLS-Year22, that captures the evolution of TLS traffic in an ISP network over a year. The dataset contains 180 web service labels and standard TC features, such as packet sequences. The unique year-long time span enables comprehensive evaluation of TC models and assessment of their robustness in the face of the ever-changing environment of production networks.
Data description The dataset consists of network flows describing encrypted TLS communications. Flows are extended with packet sequences, histograms, and fields extracted from the TLS ClientHello message, which is transmitted in the first packet of the TLS connection handshake. The most important extracted handshake field is the SNI domain, which is used for ground-truth labeling.
Packet Sequences Sequences of packet sizes, directions, and inter-packet times are standard data input for traffic analysis. For packet sizes, we consider the payload size after transport headers (TCP headers for the TLS case). We omit packets with no TCP payload, for example ACKs, because zero-payload packets are related to the transport layer internals rather than services’ behavior. Packet directions are encoded as ±1, where +1 means a packet sent from client to server, and -1 is a packet from server to client. Inter-packet times depend on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate a response. Packet sequences have a maximum length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time duration, and the number of roundtrips. The roundtrips are counted as the number of changes in the communication direction; in other words, each client request and server response pair counts as one roundtrip.
Flow statistics Each data record also includes standard flow statistics, representing aggregated information about the entire bidirectional connection. The fields are the number of transmitted bytes and packets in both directions, the duration of the flow, and packet histograms. The packet histograms include binned counts (not limited to the first 30 packets) of packet sizes and inter-packet times in both directions. There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes (More information in the PHISTS plugin documentation). Moreover, each flow has its end reason---either it ended with the TCP connection termination (FIN packets), was idle, reached the active timeout, or ended due to other reasons. This corresponds with the official IANA IPFIX-specified values. The FLOW_ENDREASON_OTHER field represents the forced end and lack of resources reasons.
Dataset structure The dataset is organized per weeks and individual days. The flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. For each flow data file, there is a JSON file with the total number of saved flows and the number of flows per service. There are also files aggregating flow counts for each week (stats-week.json) and for the entire dataset (stats-dataset.json). The following list describes flow data fields in CSV files:
ID: Unique identifier
SRC_IP: Source IP address
DST_IP: Destination IP address
DST_ASN: Destination Autonomous System number
SRC_PORT: Source port
DST_PORT: Destination port
PROTOCOL: Transport protocol
FLAG_CWR: Presence of the CWR flag
FLAG_CWR_REV: Presence of the CWR flag in the reverse direction
FLAG_ECE: Presence of the ECE flag
FLAG_ECE_REV: Presence of the ECE flag in the reverse direction
FLAG_URG: Presence of the URG flag
FLAG_URG_REV: Presence of the URG flag in the reverse direction
FLAG_ACK: Presence of the ACK flag
FLAG_ACK_REV: Presence of the ACK flag in the reverse direction
FLAG_PSH: Presence of the PSH flag
FLAG_PSH_REV: Presence of the PSH flag in the reverse direction
FLAG_RST: Presence of the RST flag
FLAG_RST_REV: Presence of the RST flag in the reverse direction
FLAG_SYN: Presence of the SYN flag
FLAG_SYN_REV: Presence of the SYN flag in the reverse direction
FLAG_FIN: Presence of the FIN flag
FLAG_FIN_REV: Presence of the FIN flag in the reverse direction
TLS_SNI: Server Name Indication domain
TLS_JA3: JA3 fingerprint of TLS client
TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff
TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff
DURATION: Duration of the flow in seconds
BYTES: Number of transmitted bytes from client to server
BYTES_REV: Number of transmitted bytes from server to client
PACKETS: Number of packets transmitted from client to server
PACKETS_REV: Number of packets transmitted from server to client
PPI: Packet sequence in the format: [[inter-packet times], [packet directions], [packet sizes], [push flags]]
PPI_LEN: Number of packets in the PPI sequence
PPI_DURATION: Duration of the PPI sequence in seconds
PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence
PHIST_SRC_SIZES: Histogram of packet sizes from client to server
PHIST_DST_SIZES: Histogram of packet sizes from server to client
PHIST_SRC_IPT: Histogram of inter-packet times from client to server
PHIST_DST_IPT: Histogram of inter-packet times from server to client
APP: Web service label
CATEGORY: Service category
FLOW_ENDREASON_IDLE: Flow was terminated because it was idle
FLOW_ENDREASON_ACTIVE: Flow was terminated because it reached the active timeout
FLOW_ENDREASON_END: Flow ended with the TCP connection termination
FLOW_ENDREASON_OTHER: Flow was terminated for other reasons
创建时间:
2025-03-24



