yashika0998/iot-23-preprocessed

Name: yashika0998/iot-23-preprocessed
Creator: yashika0998
Published: 2023-12-01 18:27:06
License: 暂无描述

Hugging Face2023-12-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/yashika0998/iot-23-preprocessed

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id.orig_p dtype: int64 - name: id.resp_p dtype: int64 - name: proto dtype: string - name: service dtype: string - name: duration dtype: float64 - name: orig_bytes dtype: int64 - name: resp_bytes dtype: int64 - name: conn_state dtype: string - name: missed_bytes dtype: int64 - name: history dtype: string - name: orig_pkts dtype: int64 - name: orig_ip_bytes dtype: int64 - name: resp_pkts dtype: int64 - name: resp_ip_bytes dtype: int64 - name: label dtype: string splits: - name: train num_bytes: 93994789 num_examples: 819024 download_size: 11805369 dataset_size: 93994789 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - question-answering - tabular-classification language: - en tags: - code pretty_name: d --- # Aposemat IoT-23 - a Labeled Dataset with Malcious and Benign Iot Network Traffic **Homepage:** [https://www.stratosphereips.org/datasets-iot23](https://www.stratosphereips.org/datasets-iot23) This dataset contains a subset of the data from 20 captures of Malcious network traffic and 3 captures from live Benign Traffic on Internet of Things (IoT) devices. Created by Sebastian Garcia, Agustin Parmisano, & Maria Jose Erquiaga at the Avast AIC laboratory with the funding of Avast Software, this dataset is one of the best in the field for Intrusion Detection Systems (IDS) for IoT Devices [(Comparative Analysis of IoT Botnet Datasets)](https://doi.org/10.53070/bbd.1173687). The selection of the subset was determined by [Aqeel Ahmed on Kaggle](https://www.kaggle.com/datasets/engraqeel/iot23preprocesseddata) and contains 6 million samples. The Kaggle upload, nor this one, have employed data balancing. The Kaggle card does not contain methodology to understand what criteria was used to select these samples. If you want ensure best practice, use this dataset to mock-up processing the data into a model before using the full dataset with data balancing. This will require processing the 8GB of conn.log.labelled files. This dataset only notes if the data is Malcious or Benign. The original dataset labels the type of malcious traffic aswell. This means this processing of the dataset is only suited for binary classification. # Feature information: All features originate from the [Zeek](https://docs.zeek.org/en/master/scripts/base/protocols/conn/main.zeek.html#type-Conn::Info) processing performed by the dataset creators. [See notes here for caviats for each column](https://docs.zeek.org/en/master/scripts/base/protocols/conn/main.zeek.html#type-Conn::Info). <details> <summary>Expand for feature names, descriptions, and datatypes</summary> Name: id.orig_p Description: The originator’s port number. Data type: int64 - uint64 in original Name: id.resp_p Description: The responder’s port number. Data type: int64 - uint64 in original Name: proto Description: The transport layer protocol of the connection. Data type: string - enum(unknown_transport, tcp, udp, icmp). Only TCP and UDP in subset Name: service Description: An identification of an application protocol being sent over the connection. Data type: optional string Name: duration Description: How long the connection lasted. Data type: optional float64 - time interval Name: orig_bytes Description: The number of payload bytes the originator sent. Data type: optional int64 - uint64 in original Name: resp_bytes Description:The number of payload bytes the responder sent. Data type: optional int64 - uint64 in original Name: conn_state Description: Value indicating connection state. (S0, S1, SF, REJ, S2, S3, RSTO, RSTR, RSTOS0, RSTRH, SH, SHR, OTH) Data type: optional string Name: missed_bytes Description: Indicates the number of bytes missed in content gaps, which is representative of packet loss. Data type: optional int64 - uint64 in original. default = 0 Name: history Description: Records the state history of connections as a string of letters. Data type: optional string Name: orig_pkts Description: Number of packets that the originator sent. Data type: optional int64 - uint64 in original Name: orig_ip_bytes Description: Number of IP level bytes that the originator sent. Data type: optional int64 - uint64 in original Name: resp_pkts Description: Number of packets that the responder sent. Data type: optional int64 - uint64 in original Name: resp_ip_bytes Description: Number of IP level bytes that the responder sent. Data type: optional int64 - uint64 in original Name: label Description: Specifies if data point is benign or some form of malicious. See the dataset creators paper for descriptions of attack types Data type: string - enum(Malicious, Benign) NOTE: ts, uid, id.orig_h, id.resp_h have been removed as they are dataset specific. Models should not be trained with specific timestamps or IP addresses (id.orig_h) using this dataset, as that can lead to over fitting to dataset specific times and addresses. Further local_orig, local_resp have been removed as they are null in all rows, so they are useless for training. </details> ## Citation If you are using this dataset for your research, please reference it as “Sebastian Garcia, Agustin Parmisano, & Maria Jose Erquiaga. (2020). IoT-23: A labeled dataset with malicious and benign IoT network traffic (Version 1.0.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4743746”

# Aposemat IoT-23——带恶意与良性物联网（IoT）网络流量的标注数据集 **主页：** [https://www.stratosphereips.org/datasets-iot23](https://www.stratosphereips.org/datasets-iot23) 本数据集源自20份恶意网络流量捕获样本与3份物联网设备真实良性流量捕获样本的子集。该数据集由Avast AIC实验室的Sebastian Garcia、Agustin Parmisano与Maria Jose Erquiaga打造，由Avast软件资助，是物联网设备入侵检测系统（Intrusion Detection Systems, IDS）领域的优质数据集之一[《物联网僵尸网络数据集对比分析》](https://doi.org/10.53070/bbd.1173687)。该子集由Kaggle平台用户Aqeel Ahmed筛选，包含600万个样本。本次上传的数据集与Kaggle版本均未进行数据平衡处理，且Kaggle页面未说明筛选样本的具体标准。若需遵循最佳实践，建议在使用完整数据集并进行数据平衡前，先基于本数据集搭建模型处理流程。完整数据集需处理8GB的conn.log.labelled文件。本数据集仅标注样本为恶意或良性，原始数据集还会标注恶意流量的具体类型，因此本处理后的数据集仅适用于二分类任务。 ## 数据集元信息 ### 特征列表： - 名称：id.orig_p，数据类型：int64 - 名称：id.resp_p，数据类型：int64 - 名称：proto，数据类型：字符串 - 名称：service，数据类型：字符串 - 名称：duration，数据类型：float64 - 名称：orig_bytes，数据类型：int64 - 名称：resp_bytes，数据类型：int64 - 名称：conn_state，数据类型：字符串 - 名称：missed_bytes，数据类型：int64 - 名称：history，数据类型：字符串 - 名称：orig_pkts，数据类型：int64 - 名称：orig_ip_bytes，数据类型：int64 - 名称：resp_pkts，数据类型：int64 - 名称：resp_ip_bytes，数据类型：int64 - 名称：label，数据类型：字符串 ### 数据划分： - 名称：训练集（train），字节数：93994789，样本数：819024 - 下载大小：11805369字节 - 数据集总大小：93994789字节 ### 配置项： - 配置名称：default，数据文件： - 划分集：train，路径：data/train-* ### 任务类别： - 问答（question-answering） - 表格分类（tabular-classification） ### 语言：英语（en） ### 标签： - 代码（code） ### 易读名称：d ## 特征信息所有特征均源自数据集制作者使用[Zeek](https://docs.zeek.org/en/master/scripts/base/protocols/conn/main.zeek.html#type-Conn::Info)工具完成的流量处理流程，[各列的注意事项可参考此处](https://docs.zeek.org/en/master/scripts/base/protocols/conn/main.zeek.html#type-Conn::Info)。 <details> <summary>展开查看特征名称、描述与数据类型</summary> 名称：id.orig_p 描述：流量发起方的端口号数据类型：int64，原始格式为uint64 名称：id.resp_p 描述：流量响应方的端口号数据类型：int64，原始格式为uint64 名称：proto 描述：连接使用的传输层协议数据类型：字符串，枚举值包括unknown_transport、tcp、udp、icmp，本子集仅包含TCP与UDP协议名称：service 描述：本次连接所使用的应用层协议标识数据类型：可选字符串名称：duration 描述：连接持续时长数据类型：可选float64，单位为时间间隔名称：orig_bytes 描述：流量发起方发送的有效载荷字节数数据类型：可选int64，原始格式为uint64 名称：resp_bytes 描述：流量响应方发送的有效载荷字节数数据类型：可选int64，原始格式为uint64 名称：conn_state 描述：连接状态标识，可选值包括S0、S1、SF、REJ、S2、S3、RSTO、RSTR、RSTOS0、RSTRH、SH、SHR、OTH 数据类型：可选字符串名称：missed_bytes 描述：内容间隙中丢失的字节数，可反映数据包丢失情况数据类型：可选int64，原始格式为uint64，默认值为0 名称：history 描述：以字符串形式记录的连接状态历史数据类型：可选字符串名称：orig_pkts 描述：流量发起方发送的数据包总数数据类型：可选int64，原始格式为uint64 名称：orig_ip_bytes 描述：流量发起方发送的IP层字节总数数据类型：可选int64，原始格式为uint64 名称：resp_pkts 描述：流量响应方发送的数据包总数数据类型：可选int64，原始格式为uint64 名称：resp_ip_bytes 描述：流量响应方发送的IP层字节总数数据类型：可选int64，原始格式为uint64 名称：label 描述：标注样本为良性或恶意流量，攻击类型的详细说明请参阅数据集制作者的论文数据类型：字符串，枚举值包括Malicious、Benign 注意：ts、uid、id.orig_h、id.resp_h字段已被移除，因为它们属于数据集专属字段。使用本数据集训练模型时，不应使用特定时间戳或IP地址（id.orig_h）作为特征，否则会导致模型过拟合到本数据集的专属时间与地址信息。此外，local_orig与local_resp字段已被移除，因为所有行的该字段均为空值，无法用于模型训练。 </details> ## 引用说明若您的研究中使用了本数据集，请按照以下格式引用："Sebastian Garcia, Agustin Parmisano, & Maria Jose Erquiaga. (2020). IoT-23: A labeled dataset with malicious and benign IoT network traffic (Version 1.0.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4743746"

提供机构：

yashika0998

原始信息汇总

数据集概述

数据集信息

特征列表：
- id.orig_p：发起方的端口号，数据类型为 int64。
- id.resp_p：响应方的端口号，数据类型为 int64。
- proto：连接的传输层协议，数据类型为 string。
- service：连接上发送的应用程序协议标识，数据类型为 string。
- duration：连接持续时间，数据类型为 float64。
- orig_bytes：发起方发送的有效载荷字节数，数据类型为 int64。
- resp_bytes：响应方发送的有效载荷字节数，数据类型为 int64。
- conn_state：连接状态值，数据类型为 string。
- missed_bytes：内容间隙中丢失的字节数，代表数据包丢失，数据类型为 int64。
- history：连接状态历史记录，数据类型为 string。
- orig_pkts：发起方发送的数据包数量，数据类型为 int64。
- orig_ip_bytes：发起方发送的IP级别字节数，数据类型为 int64。
- resp_pkts：响应方发送的数据包数量，数据类型为 int64。
- resp_ip_bytes：响应方发送的IP级别字节数，数据类型为 int64。
- label：数据点是否为良性或某种形式的恶意，数据类型为 string。
数据分割：
- train：训练集，包含 93,994,789 字节，819,024 个样本。
数据集大小：
- 下载大小：11,805,369 字节。
- 数据集大小：93,994,789 字节。

配置信息

默认配置：
- 数据文件路径：data/train-*。

任务类别

问答
表格分类

语言

英语

数据集名称

搜集汇总

数据集介绍

构建方式

在物联网安全研究领域，数据集的构建需兼顾真实性与代表性。本数据集源自Aposemat IoT-23原始数据，由Avast AIC实验室的Sebastian Garcia等人采集，包含20组恶意流量与3组良性流量捕获记录。数据预处理环节由Aqeel Ahmed在Kaggle平台完成，从原始8GB的conn.log.labelled文件中筛选出约600万条样本，并利用Zeek网络分析工具提取了15维特征，涵盖连接端口、协议类型、传输字节数及连接状态等关键网络流属性。值得注意的是，构建过程中移除了时间戳、IP地址等易导致模型过拟合的字段，确保了数据的泛化适用性。

特点

该数据集的核心特点在于其专注于物联网场景下的网络入侵检测。所有特征均源自Zeek工具生成的标准连接日志，包括源目的端口、传输层协议、数据包数量与字节量等结构化字段，并统一标注为“恶意”或“良性”二元标签。相较于原始数据集的多类攻击标注，本版本简化了分类维度，更适用于二元分类任务。特征设计充分考虑了模型训练的实用性，剔除了全空字段及设备特异性信息，从而降低了噪声干扰，为构建轻量高效的入侵检测模型提供了高质量基础。

使用方法

该数据集适用于物联网安全领域的二元分类模型开发与评估。使用者可直接通过HuggingFace平台加载数据，利用预划分的训练集进行模型训练。建议在正式应用前，先以本数据集进行流程验证，再结合原始完整数据实施数据平衡处理。特征字段均为数值或枚举类型，可直接输入经典机器学习算法或神经网络模型。需注意，数据集中未包含攻击子类信息，因此仅支持恶意流量的整体识别。研究引用时，应遵循提供的标准文献引用格式，以确保学术规范性。

背景与挑战

背景概述

在物联网安全研究领域，网络流量数据的标注与分析对于构建高效的入侵检测系统至关重要。IoT-23数据集由Avast AIC实验室的Sebastian Garcia、Agustin Parmisano和Maria Jose Erquiaga于2020年创建，并得到Avast Software的资助。该数据集聚焦于物联网设备中的恶意与良性网络流量识别，旨在解决物联网环境下网络攻击检测的核心研究问题。通过整合20个恶意流量捕获和3个良性流量捕获样本，该数据集为物联网安全领域提供了高质量的基准数据，显著推动了入侵检测算法的开发与评估，成为该领域内广泛引用的重要资源。

当前挑战

该数据集致力于解决物联网网络流量中的恶意行为检测挑战，其核心在于从高维网络特征中准确区分恶意与良性流量，这要求模型具备处理不平衡数据分布和复杂协议模式的能力。在构建过程中，挑战主要体现在数据预处理阶段：原始数据包含8GB的conn.log.labelled文件，需进行高效筛选与特征提取；同时，数据子集的选择由第三方完成，缺乏明确的样本选择方法论，可能引入偏差。此外，为规避过拟合风险，时间戳和IP地址等特定信息被移除，这增加了特征工程的复杂性，要求研究者谨慎处理数据平衡与泛化性能之间的权衡。

常用场景

经典使用场景

在物联网安全研究领域，yashika0998/iot-23-preprocessed数据集为入侵检测系统的开发提供了关键支撑。该数据集通过捕获真实物联网设备网络流量，包含恶意与良性流量样本，其经典应用场景在于构建和评估基于机器学习的二进制分类模型。研究者利用其丰富的网络连接特征，如协议类型、数据包大小和连接状态，训练模型以精准区分异常攻击行为与正常网络活动，为物联网环境下的实时威胁识别奠定数据基础。

衍生相关工作

围绕该数据集衍生的经典工作包括多项前沿研究，例如基于深度学习的异常检测模型如LSTM与CNN的融合架构，这些模型利用序列化流量特征提升了检测精度。此外，学者们结合联邦学习技术开发了分布式隐私保护方案，以应对物联网数据孤岛问题。相关成果发表于IEEE Security、ACM CCS等顶级会议，进一步推动了轻量级可解释检测算法与跨平台威胁情报共享机制的发展。

数据集最近研究