CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines
收藏Mendeley Data2024-06-29 更新2024-06-29 收录
下载链接:
https://zenodo.org/record/7409924
下载链接
链接失效反馈官方服务:
资源简介:
Please refer to the original data article for further data description: Jan Luxemburk et al. CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines, Data in Brief, 2023, 108888, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2023.108888. The QUIC (Quick UDP Internet Connection) protocol has the potential to replace TLS over TCP, which is the standard choice for reliable and secure Internet communication. Due to its design that makes the inspection of QUIC handshakes challenging and its usage in HTTP/3, there is an increasing demand for research in QUIC traffic analysis. This dataset contains one month of QUIC traffic collected in an ISP backbone network, which connects 500 large institutions and serves around half a million people. The data are delivered as enriched flows that can be useful for various network monitoring tasks. The provided server names and packet-level information allow research in the encrypted traffic classification area. Moreover, included QUIC versions and user agents (smartphone, web browser, and operating system identifiers) provide information for large-scale QUIC deployment studies. Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for four weeks between 31.10.2022 and 27.11.2022. The following table provides per-week flow count, capture period, and uncompressed size: Name Uncompressed Size Capture Period Flows W-2022-44 19 GB 31.10.2022 - 6.11.2022 32.6M W-2022-45 25 GB 7.11.2022 - 13.11.2022 42.6M W-2022-46 20 GB 14.11.2022 - 20.11.2022 33.7M W-2022-47 25 GB 21.11.2022 - 27.11.2022 44.1M CESNET-QUIC22 89 GB 31.10.2022 - 27.11.2022 153M Data description The dataset consists of network flows describing encrypted QUIC communications. Flows were created using ipfixprobe flow exporter and are extended with packet metadata sequences, packet histograms, and with fields extracted from the QUIC Initial Packet, which is the first packet of the QUIC connection handshake. The extracted handshake fields are the Server Name Indication (SNI) domain, the used version of the QUIC protocol, and the user agent string that is available in a subset of QUIC communications. Packet Sequences Flows in the dataset are extended with sequences of packet sizes, directions, and inter-packet times. For the packet sizes, we consider payload size after transport headers (UDP headers for the QUIC case). Packet directions are encoded as ±1, +1 meaning a packet sent from client to server, and -1 a packet from server to client. Inter-packet times depend on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate the response to be sent in the next packet. Packet metadata sequences have a length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time duration, and the number of roundtrips. The roundtrips are counted as the number of changes in the communication direction (from packet directions data); in other words, each client request and server response pair counts as one roundtrip. Flow statistics Flows also include standard flow statistics, which represent aggregated information about the entire bidirectional flow. The fields are: the number of transmitted bytes and packets in both directions, the duration of flow, and packet histograms. Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow in both directions (more information in the PHISTS plugin documentation There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. Moreover, each flow has its end reason - either it was idle, reached the active timeout, or ended due to other reasons. This corresponds with the official IANA IPFIX-specified values. The FLOW_ENDREASON_OTHER field represents the forced end and lack of resources reasons. The end of flow detected reason is not considered because it is not relevant for UDP connections. Dataset structure The dataset flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided table. For each flow data file, there is a JSON file with the number of saved and seen (before sampling) flows per service and total counts of all received (observed on the CESNET2 network), service (belonging to one of the dataset's services), and saved (provided in the dataset) flows. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. Flow counts before sampling can be used to compute sampling ratios of individual services and to resample the dataset back to the original service distribution. Moreover, various dataset statistics, such as feature distributions and value counts of QUIC versions and user agents, are provided in the dataset-statistics folder. The following table describes flow data fields in CSV files: Column Name Column Description ID Unique identifier SRC_IP Source IP address DST_IP Destination IP address DST_ASN Destination Autonomous System number SRC_PORT Source port DST_PORT Destination port PROTOCOL Transport protocol QUIC_VERSION QUIC protocol version QUIC_SNI Server Name Indication domain QUIC_USER_AGENT User agent string, if available in the QUIC Initial Packet TIME_FIRST Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff TIME_LAST Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff DURATION Duration of the flow in seconds BYTES Number of transmitted bytes from client to server BYTES_REV Number of transmitted bytes from server to client PACKETS Number of packets transmitted from client to server PACKETS_REV Number of packets transmitted from server to client PPI Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]] PPI_LEN Number of packets in the PPI sequence PPI_DURATION Duration of the PPI sequence in seconds PPI_ROUNDTRIPS Number of roundtrips in the PPI sequence PHIST_SRC_SIZES Histogram of packet sizes from client to server PHIST_DST_SIZES Histogram of packet sizes from server to client PHIST_SRC_IPT Histogram of inter-packet times from client to server PHIST_DST_IPT Histogram of inter-packet times from server to client APP Web service label CATEGORY Service category FLOW_ENDREASON_IDLE Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER Flow was terminated for other reasons Link to other CESNET datasets https://www.liberouter.org/technology-v2/tools-services-datasets/datasets/ Please cite the original data article: @article{CESNETQUIC22,
author = {Jan Luxemburk and Karel Hynek and Tomáš Čejka and Andrej Lukačovič and Pavel Šiška},
title = {CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines},
journal = {Data in Brief},
pages = {108888},
year = {2023},
issn = {2352-3409},
doi = {https://doi.org/10.1016/j.dib.2023.108888},
url = {https://www.sciencedirect.com/science/article/pii/S2352340923000069}
}
创建时间:
2023-06-28
搜集汇总
数据集介绍

背景与挑战
背景概述
CESNET-QUIC22是一个包含一个月QUIC网络流量的大型数据集,数据来自连接500个大型机构的ISP骨干网络。数据集提供了丰富的流信息,包括服务器名称、包级信息、QUIC版本和用户代理,适用于加密流量分类和QUIC部署研究。
以上内容由遇见数据集搜集并总结生成



