CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines
收藏Mendeley Data2024-05-10 更新2024-06-28 收录
下载链接:
https://zenodo.org/records/10728760
下载链接
链接失效反馈官方服务:
资源简介:
Please refer to the original data article for further data description: Jan Luxemburk et al. CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines, Data in Brief, 2023, 108888, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2023.108888. We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo. The QUIC (Quick UDP Internet Connection) protocol has the potential to replace TLS over TCP, which is the standard choice for reliable and secure Internet communication. Due to its design that makes the inspection of QUIC handshakes challenging and its usage in HTTP/3, there is an increasing demand for research in QUIC traffic analysis. This dataset contains one month of QUIC traffic collected in an ISP backbone network, which connects 500 large institutions and serves around half a million people. The data are delivered as enriched flows that can be useful for various network monitoring tasks. The provided server names and packet-level information allow research in the encrypted traffic classification area. Moreover, included QUIC versions and user agents (smartphone, web browser, and operating system identifiers) provide information for large-scale QUIC deployment studies. Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for four weeks between 31.10.2022 and 27.11.2022. The following list provides per-week flow count, capture period, and uncompressed size: W-2022-44 Uncompressed Size: 19 GB Capture Period: 31.10.2022 - 6.11.2022 Number of flows: 32.6M W-2022-45 Uncompressed Size: 25 GB Capture Period: 7.11.2022 - 13.11.2022 Number of flows: 42.6M W-2022-46 Uncompressed Size: 20 GB Capture Period: 14.11.2022 - 20.11.2022 Number of flows: 33.7M W-2022-47 Uncompressed Size: 25 GB Capture Period: 21.11.2022 - 27.11.2022 Number of flows: 44.1M CESNET-QUIC22 Uncompressed Size: 89 GB Capture Period: 31.10.2022 - 27.11.2022 Number of flows: 153M Data description The dataset consists of network flows describing encrypted QUIC communications. Flows were created using ipfixprobe flow exporter and are extended with packet metadata sequences, packet histograms, and with fields extracted from the QUIC Initial Packet, which is the first packet of the QUIC connection handshake. The extracted handshake fields are the Server Name Indication (SNI) domain, the used version of the QUIC protocol, and the user agent string that is available in a subset of QUIC communications. Packet Sequences Flows in the dataset are extended with sequences of packet sizes, directions, and inter-packet times. For the packet sizes, we consider payload size after transport headers (UDP headers for the QUIC case). Packet directions are encoded as ±1, +1 meaning a packet sent from client to server, and -1 a packet from server to client. Inter-packet times depend on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate the response to be sent in the next packet. Packet metadata sequences have a length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time duration, and the number of roundtrips. The roundtrips are counted as the number of changes in the communication direction (from packet directions data); in other words, each client request and server response pair counts as one roundtrip. Flow statistics Flows also include standard flow statistics, which represent aggregated information about the entire bidirectional flow. The fields are: the number of transmitted bytes and packets in both directions, the duration of flow, and packet histograms. Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow in both directions (more information in the PHISTS plugin documentation There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. Moreover, each flow has its end reason - either it was idle, reached the active timeout, or ended due to other reasons. This corresponds with the official IANA IPFIX-specified values. The FLOW_ENDREASON_OTHER field represents the forced end and lack of resources reasons. The end of flow detected reason is not considered because it is not relevant for UDP connections. Dataset structure The dataset flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. For each flow data file, there is a JSON file with the number of saved and seen (before sampling) flows per service and total counts of all received (observed on the CESNET2 network), service (belonging to one of the dataset's services), and saved (provided in the dataset) flows. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. Flow counts before sampling can be used to compute sampling ratios of individual services and to resample the dataset back to the original service distribution. Moreover, various dataset statistics, such as feature distributions and value counts of QUIC versions and user agents, are provided in the dataset-statistics folder. The mapping between services and service providers is provided in the servicemap.csv file, which also includes SNI domains used for ground truth labeling. The following list describes flow data fields in CSV files: ID: Unique identifier SRC_IP: Source IP address DST_IP: Destination IP address DST_ASN: Destination Autonomous System number SRC_PORT: Source port DST_PORT: Destination port PROTOCOL: Transport protocol QUIC_VERSION QUIC: protocol version QUIC_SNI: Server Name Indication domain QUIC_USER_AGENT: User agent string, if available in the QUIC Initial Packet TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff DURATION: Duration of the flow in seconds BYTES: Number of transmitted bytes from client to server BYTES_REV: Number of transmitted bytes from server to client PACKETS: Number of packets transmitted from client to server PACKETS_REV: Number of packets transmitted from server to client PPI: Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]] PPI_LEN: Number of packets in the PPI sequence PPI_DURATION: Duration of the PPI sequence in seconds PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence PHIST_SRC_SIZES: Histogram of packet sizes from client to server PHIST_DST_SIZES: Histogram of packet sizes from server to client PHIST_SRC_IPT: Histogram of inter-packet times from client to server PHIST_DST_IPT: Histogram of inter-packet times from server to client APP: Web service label CATEGORY: Service category FLOW_ENDREASON_IDLE: Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE: Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER: Flow was terminated for other reasons Link to other CESNET datasets https://www.liberouter.org/technology-v2/tools-services-datasets/datasets/ https://github.com/CESNET/cesnet-datazoo Please cite the original data article: @article{CESNETQUIC22, author = {Jan Luxemburk and Karel Hynek and Tomáš Čejka and Andrej Lukačovič and Pavel Šiška}, title = {CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines}, journal = {Data in Brief}, pages = {108888}, year = {2023}, issn = {2352-3409}, doi = {https://doi.org/10.1016/j.dib.2023.108888}, url = {https://www.sciencedirect.com/science/article/pii/S2352340923000069} }
如需进一步了解该数据集的详细信息,请参阅原始数据论文:Jan Luxemburk 等人的《CESNET-QUIC22:来自骨干网的大规模QUIC网络流量数据集》,发表于*Data in Brief*,2023年,文章编号108888,ISSN 2352-3409,DOI:https://doi.org/10.1016/j.dib.2023.108888。
我们推荐使用CESNET DataZoo Python库,该库可便捷处理大规模网络流量数据集。有关DataZoo项目的更多信息,可访问其GitHub仓库:https://github.com/CESNET/cesnet-datazoo。
QUIC(快速UDP互联网连接,Quick UDP Internet Connection)协议有望替代基于TCP的TLS(传输层安全,Transport Layer Security)——当前可靠安全的互联网通信的标准选择。由于其设计使得对QUIC握手报文的检测颇具挑战,加之其在HTTP/3中的应用,学界对QUIC流量分析的研究需求日益增长。
本数据集包含在某ISP骨干网络中采集的为期一个月的QUIC流量,该网络连接了500家大型机构,服务约50万用户。数据以增强型流格式提供,可用于各类网络监控任务。提供的服务器名称指示(Server Name Indication,SNI)信息与报文级元数据,可支持加密流量分类领域的研究。此外,数据集中包含的QUIC版本与用户代理(User Agent)标识(涵盖智能手机、网页浏览器与操作系统标识符),可为大规模QUIC部署研究提供支撑。
### 数据采集
数据采集于CESNET2网络的流量监控基础设施中,采集周期为2022年10月31日至2022年11月27日,共四周。以下为各周的流数量、采集时段与未压缩大小:
- W-2022-44:未压缩大小19 GB,采集时段2022.10.31 – 2022.11.06,流数量3260万
- W-2022-45:未压缩大小25 GB,采集时段2022.11.07 – 2022.11.13,流数量4260万
- W-2022-46:未压缩大小20 GB,采集时段2022.11.14 – 2022.11.20,流数量3370万
- W-2022-47:未压缩大小25 GB,采集时段2022.11.21 – 2022.11.27,流数量4410万
- CESNET-QUIC22总集:未压缩大小89 GB,采集时段2022.10.31 – 2022.11.27,流数量1.53亿
### 数据描述
数据集由描述加密QUIC通信的网络流组成,流通过ipfixprobe流导出器生成,并扩展了报文元数据序列、报文直方图以及从QUIC初始化报文(QUIC连接握手的首个报文)中提取的字段。提取的握手字段包括SNI域名、所用QUIC协议版本,以及部分QUIC通信中可用的用户代理字符串。
#### 报文序列
数据集中的流扩展了报文大小、方向与报文间时间间隔的序列。对于报文大小,我们采用传输层头部(针对QUIC场景则为UDP头部)后的有效载荷大小。报文方向以±1编码:+1代表客户端到服务器的报文,-1代表服务器到客户端的报文。报文间时间间隔取决于通信主机的位置、距离以及路径上的网络状况,但仍可提取与用户交互相关的有效信息,例如与API/服务器/数据库处理接收数据并生成下一报文响应所需时间相关的特征。
报文元数据序列的长度为30,这是所用流导出器的默认设置。我们还从每个报文序列中推导了三个字段:序列长度、时间持续时长与往返次数。往返次数按通信方向的变化次数统计(来自报文方向数据),换言之,每一组客户端请求与服务器响应对计为一次往返。
#### 流统计信息
流还包含标准流统计字段,代表整个双向流的聚合信息,字段包括:双向传输的字节数与报文数、流持续时长,以及报文直方图。报文直方图包含双向流的报文大小与报文间时间间隔的分箱计数(更多信息详见PHISTS插件文档)。采用对数刻度的8个分箱,区间分别为0-15、16-31、32-63、64-127、128-255、256-511、512-1024、>1024 [单位:毫秒或字节],其中报文间时间间隔的单位为毫秒,报文大小的单位为字节。
此外,每个流都带有其终止原因:因空闲超时终止、因活跃超时终止,或因其他原因终止,这与互联网编号分配机构(Internet Assigned Numbers Authority,IANA)官方指定的IPFIX流终止原因值一致。其中`FLOW_ENDREASON_OTHER`字段代表强制终止与资源不足两类原因,因检测到流结束的原因不适用于UDP连接,故未纳入统计。
### 数据集结构
数据集的流以压缩CSV文件格式提供,每个CSV文件的每一行代表一条流,数据列汇总如下。每个流数据文件配套一个JSON文件,记录各服务的已保存流与采样前已观测流的数量,以及所有观测到的(在CESNET2网络上捕获的)、按服务分类的(属于本数据集服务之一)与已保存(包含在数据集中的)流的总计数。
此外还有`stats-week.json`文件,汇总单周的流数量;`stats-dataset.json`文件,汇总全数据集的流数量。采样前的流数量可用于计算各服务的采样率,以及将数据集重采样回原始的服务分布。此外,`dataset-statistics`文件夹中提供了各类数据集统计信息,例如特征分布、QUIC版本与用户代理的取值计数。
服务与服务提供商的映射关系存储在`servicemap.csv`文件中,该文件还包含用于真值标注的SNI域名。以下为CSV文件中的流数据字段说明:
- `ID`:唯一标识符
- `SRC_IP`:源IP地址
- `DST_IP`:目的IP地址
- `DST_ASN`:目的自治系统编号
- `SRC_PORT`:源端口
- `DST_PORT`:目的端口
- `PROTOCOL`:传输协议
- `QUIC_VERSION`:QUIC协议版本
- `QUIC_SNI`:服务器名称指示(SNI)域名
- `QUIC_USER_AGENT`:用户代理字符串(若QUIC初始化报文中存在该字段)
- `TIME_FIRST`:首报文时间戳,格式为YYYY-MM-DDTHH-MM-SS.ffffff
- `TIME_LAST`:末报文时间戳,格式为YYYY-MM-DDTHH-MM-SS.ffffff
- `DURATION`:流持续时长,单位为秒
- `BYTES`:客户端到服务器方向传输的字节数
- `BYTES_REV`:服务器到客户端方向传输的字节数
- `PACKETS`:客户端到服务器方向传输的报文数
- `PACKETS_REV`:服务器到客户端方向传输的报文数
- `PPI`:报文元数据序列,格式为[[inter-packet times], [packet directions], [packet sizes]]
- `PPI_LEN`:PPI序列中的报文数量
- `PPI_DURATION`:PPI序列的持续时长,单位为秒
- `PPI_ROUNDTRIPS`:PPI序列中的往返次数
- `PHIST_SRC_SIZES`:客户端到服务器方向的报文大小直方图
- `PHIST_DST_SIZES`:服务器到客户端方向的报文大小直方图
- `PHIST_SRC_IPT`:客户端到服务器方向的报文间时间间隔直方图
- `PHIST_DST_IPT`:服务器到客户端方向的报文间时间间隔直方图
- `APP`:Web服务标签
- `CATEGORY`:服务分类
- `FLOW_ENDREASON_IDLE`:流因空闲超时终止
- `FLOW_ENDREASON_ACTIVE`:流因活跃超时终止
- `FLOW_ENDREASON_OTHER`:流因其他原因终止
### 其他资源与引用
其他CESNET数据集链接:
https://www.liberouter.org/technology-v2/tools-services-datasets/datasets/
https://github.com/CESNET/cesnet-datazoo
请引用原始数据论文:
bibtex
@article{CESNETQUIC22,
author = {Jan Luxemburk and Karel Hynek and Tomáš Čejka and Andrej Lukačovič and Pavel Šiška},
title = {CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines},
journal = {Data in Brief},
pages = {108888},
year = {2023},
issn = {2352-3409},
doi = {https://doi.org/10.1016/j.dib.2023.108888},
url = {https://www.sciencedirect.com/science/article/pii/S2352340923000069}
}
创建时间:
2024-03-02
搜集汇总
数据集介绍

背景与挑战
背景概述
CESNET-QUIC22是一个包含一个月QUIC网络流量的大规模数据集,适用于加密流量分类和QUIC协议研究。数据来自ISP骨干网络,提供详细的流信息和元数据序列。
以上内容由遇见数据集搜集并总结生成



