Exposing TCP watermarks in the Tor network using deep learning

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/12793191

下载链接

链接失效反馈

官方服务：

资源简介：

Description The datasets consist of Tor network flows that have been captured with packet analyser Wireshark and converted into CSV format. The network flows consist of file transfers of images whose size varies from a few kilobytes to several megabytes. The captured packets are flows from the entry guard of the connection to the client. The datasets contain both clean Tor traffic and "watermarked" Tor traffic. Watermarking is a method of leaving small prints on the network flow at the sender end and trying to detect them at the receiving end. A positive detection indicates a connection between the two parties, thus breaking the anonymity aspect of the Tor. The used watermarking algorithms are "Interval-based watermarking" (IBW) presented by Pyun et al. [1] in 2007 and "Scalable watermark that is invisible and resilient to packet losses" (SWIRL) presented by Houmansadr and Borisov [2] in 2011. The algorithms were implemented with a watermarking module, which is essentially a modified TCP/IP stack. The module is publicly available [3]. The IBW-watermarked data is produced in the following way: The watermarked training data is endoded with the bit string {0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1} by introducing the following repeating delays (milliseconds): {0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0}. The watermarked test data is encoded with the bit string {0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0} by introducing the following repeating delays (milliseconds): {0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,1000,0,0,0,0, 0,0,0,0,1000,0,0,0,0}. The SWIRL-watermarked data is produced in the following way: The watermarked training data is encoded with a repeating delay string (milliseconds) {100, 0, 250, 250, 100, 250, 250, 100, 0, 0, 250, 0, 0, 100, 250, 0, 0, 250, 0, 250}, which mimics a SWIRL's permutation. The watermarked test data is encoded with a repeating delay string (milliseconds) {100, 100, 300, 300, 300, 100, 300, 100, 0, 0, 100, 100, 0, 0, 300, 0, 100, 100, 300, 0}, which also mimics a SWIRL's permutation. The datasets have been collected as a part of master's thesis work at Tampere University and are used primarily in neural network classification tests. The datasets intended for neural network training are longer and have "_train.csv" endings in their names. Datasets intended for testing are shorter and have "_test.csv" endings in their names. References [1] Y. J. Pyun, Y. H. Park, X. Wang, D. S. Reeves and P. Ning. "Tracing Traffic through Intermediate Hosts that Repacketize Flows," IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications, Anchorage, AK, USA, 2007 [2] A. Houmansadr and N. Borisov. “SWIRL: A Scalable Watermark to Detect Correlated Network Flows,” Network and Distributed System Security Symposium, San Diego, United States, 2011 [3] https://gitlab.com/nisec/tcp-watermark/

## 数据集说明本数据集包含通过数据包分析器Wireshark捕获并转换为CSV格式的Tor网络流。所述网络流均为大小从数KB到数MB不等的图像文件传输流量，所捕获的数据包为从连接的入口守卫节点（entry guard）流向客户端的网络流。数据集同时涵盖纯净Tor流量与"带水印"Tor流量。水印技术是一类在发送端向网络流中嵌入微小标识，并在接收端尝试检测该标识的技术手段：若成功检测到标识，则可证明通信双方存在连接，从而破坏Tor网络的匿名性。本次实验采用的两种水印算法分别为Pyun等于2007年提出的**基于间隔的水印（Interval-based watermarking, IBW）**[1]，以及Houmansadr与Borisov于2011年提出的**抗丢包不可见可扩展水印（Scalable watermark that is invisible and resilient to packet losses, SWIRL）**[2]。上述算法通过自研水印模块实现，该模块本质为经过修改的TCP/IP协议栈（TCP/IP stack），其源代码已公开[3]。 ## 基于间隔的水印（IBW）数据生成流程带水印的训练数据集通过以下重复时延（单位：毫秒），使用比特串`{0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1}`进行编码： {0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0} 带水印的测试数据集则通过以下重复时延（单位：毫秒），使用比特串`{0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0}`进行编码： {0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,0,1000,0,0,0,0, 1000,0,0,0,0, 1000,0,0,0,0, 0,0,0,0,1000,0,0,0,0, 0,0,0,0,1000,0,0,0,0} ## 抗丢包不可见可扩展水印（SWIRL）数据生成流程带水印的训练数据集通过以下重复时延序列（单位：毫秒）进行编码，以模拟SWIRL的置换规则： `{100, 0, 250, 250, 100, 250, 250, 100, 0, 0, 250, 0, 0, 100, 250, 0, 0, 250, 0, 250}` 带水印的测试数据集通过以下重复时延序列（单位：毫秒）进行编码，同样用于模拟SWIRL的置换规则： `{100, 100, 300, 300, 300, 100, 300, 100, 0, 0, 100, 100, 0, 0, 300, 0, 100, 100, 300, 0}` 本数据集由坦佩雷大学（Tampere University）的硕士学位论文研究项目采集所得，主要用于神经网络分类测试。其中，用于神经网络训练的数据集文件更长，文件名以"_train.csv"结尾；用于模型测试的数据集文件更短，文件名以"_test.csv"结尾。 ## 参考文献 [1] Y. J. Pyun, Y. H. Park, X. Wang, D. S. Reeves and P. Ning. "Tracing Traffic through Intermediate Hosts that Repacketize Flows," IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications, Anchorage, AK, USA, 2007 [2] A. Houmansadr and N. Borisov. "SWIRL: A Scalable Watermark to Detect Correlated Network Flows," Network and Distributed System Security Symposium, San Diego, United States, 2011 [3] https://gitlab.com/nisec/tcp-watermark/

创建时间：

2024-08-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集