Tor-nonTor dataset
收藏DataCite Commons2020-09-20 更新2025-04-09 收录
下载链接:
https://www.impactcybertrust.org/dataset_view?idDataset=921
下载链接
链接失效反馈官方服务:
资源简介:
To be sure about the quantity and diversity of this dataset in CIC, we defined a set of tasks to generate a representative dataset of real-world traffic. We created three users for the browser traffic collection and two users for the communication parts such as chat, mail, FTP, p2p, etc. For the non-Tor traffic we used previous benign traffic from VPN project and for the Tor traffic we used 7 traffic categories:
Browsing: Under this label we have HTTP and HTTPS traffic generated by users while browsing (Firefox and Chrome).
Email: Traffic samples generated using a Thunderbird client, and Alice and Bob Gmail accounts. The clients were configured to deliver mail through SMTP/S, and receive it using POP3/SSL in one client and IMAP/SSL in the other.
Chat: The chat label identifies instant-messaging applications. Under this label we have Facebook and Hangouts via web browser, Skype, and IAM and ICQ using an application called pidgin.
Audio-Streaming: The streaming label identifies audio applications that require a continuous and steady stream of data. We captured traffic from Spotify.
Video-Streaming: The streaming label identifies video applications that require a continuous and steady stream of data. We captured traffic from YouTube (HTML5 and flash versions) and Vimeo services using Chrome and Firefox.
FTP: This label identifies traffic applications whose main purpose is to send or receive files and documents. For our dataset we captured Skype file transfers, FTP over SSH (SFTP) and FTP over SSL (FTPS) traffic sessions.
VoIP: The Voice over IP label groups all traffic generated by voice applications. Within this label we captured voice-calls using Facebook, Hangouts and Skype.
P2P: This label is used to identify file-sharing protocols like Bittorrent. To generate this traffic we downloaded different .torrent files from the Kali linux distribution and captured traffic sessions using the Vuze application. We also used different combinations of upload and download speeds.
The traffic was captured using Wireshark and tcpdump, generating a total of 22GB of data. To facilitate the labeling process, as we explained in the related published paper, we captured the outgoing traffic at the workstation and the gateway simultaneously, collecting a set of pairs of .pcap files: one regular traffic pcap (workstation) and one Tor traffic pcap (gateway) file.
Later, we labelled the captured traffic in two steps. First, we processed the .pcap files captured at the workstation: we extracted the flows, and we confirmed that the majority of traffic flows were generated by application X (Skype, ftps, etc.), the object of the traffic capture. Then, we labelled all flows from the Tor .pcap file as X. ; cic@unb.ca ; a.habibi.l@unb.ca
为明确该CIC数据集的规模与多样性,我们设计了一系列任务以生成具备真实网络流量代表性的数据集。我们为浏览器流量采集设置3名用户,针对聊天、邮件、FTP、P2P等通信类流量则设置2名用户。非Tor(洋葱路由)流量采用此前VPN(虚拟专用网络)项目中的良性流量样本;Tor(洋葱路由)流量则涵盖7大类:
1. 浏览流量:该类别包含用户使用Firefox(火狐浏览器)与Chrome(谷歌浏览器)进行网页浏览时产生的HTTP及HTTPS流量。
2. 邮件流量:流量样本通过Thunderbird(雷鸟邮件客户端)以及Alice和Bob的Gmail账号生成。其中一台客户端配置为通过SMTP/S协议发送邮件,另一台分别通过POP3/SSL与IMAP/SSL协议接收邮件。
3. 聊天流量:该类别涵盖即时通讯应用产生的流量,包括通过网页浏览器使用的Facebook与Hangouts、Skype,以及通过Pidgin客户端使用的IAM与ICQ。
4. 音频流媒体流量:该类别包含需要持续稳定数据流的音频应用流量,我们采集了Spotify的流量样本。
5. 视频流媒体流量:该类别包含需要持续稳定数据流的视频应用流量,我们通过Chrome(谷歌浏览器)与Firefox(火狐浏览器)采集了YouTube(HTML5与Flash版本)及Vimeo平台的流量。
6. FTP流量:该类别涵盖以收发文件与文档为主要目的的应用流量。本数据集采集了Skype文件传输、SFTP(SSH加密FTP)及FTPS(SSL加密FTP)的会话流量。
7. VoIP(Voice over IP,网络电话)流量:该类别涵盖所有语音应用产生的流量,我们采集了通过Facebook、Hangouts及Skype发起的语音通话流量。
8. P2P流量:该类别用于标识BitTorrent等文件共享协议产生的流量。我们从Kali Linux发行版中获取不同的.torrent种子文件,通过Vuze客户端下载并采集流量会话,同时采用多种上传与下载速度组合生成流量。
本次流量采集使用Wireshark与tcpdump工具,最终生成总数据量达22GB的数据集。为简化标注流程,如我们在已发表的相关论文中所述,我们同时在工作站与网关处采集出站流量,收集了成对的.pcap文件:一份为工作站侧的常规流量数据包文件,另一份为网关侧的Tor(洋葱路由)流量数据包文件。
后续我们通过两个步骤完成流量标注:首先处理工作站侧采集的.pcap文件,提取流量流并确认多数流量流由本次采集目标应用(如Skype、FTPS等)生成;随后将Tor(洋葱路由)流量.pcap文件中的所有流量流标注为对应类别。
cic@unb.ca;a.habibi.l@unb.ca
提供机构:
IMPACT
创建时间:
2018-10-25



