five

lacg030175/UNSW-NB15

收藏
Hugging Face2026-04-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/lacg030175/UNSW-NB15
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 size_categories: - 1M<n<10M task_categories: - tabular-classification tags: - network-intrusion-detection - cybersecurity - UNSW-NB15 - IDS - binary-classification - multi-class-classification pretty_name: UNSW-NB15 Network Intrusion Detection configs: - config_name: temporal_3way data_files: - split: train path: temporal_3way/train-* - split: test path: temporal_3way/test-* - split: validation path: temporal_3way/validation-* default: true - config_name: random_3way data_files: - split: train path: random_3way/train-* - split: test path: random_3way/test-* - split: validation path: random_3way/validation-* - config_name: temporal data_files: - split: train path: temporal/train-* - split: test path: temporal/test-* - config_name: standard data_files: - split: train path: standard/train-* - split: test path: standard/test-* - config_name: random data_files: - split: train path: random/train-* - split: test path: random/test-* --- # UNSW-NB15 Network Intrusion Detection Dataset The [UNSW-NB15](https://research.unsw.edu.au/projects/unsw-nb15-dataset) dataset for network intrusion detection, provided with **two evaluation protocols** to enable fair comparison across the literature. ## Why This Dataset Exists Published results on UNSW-NB15 range from **85% to 99% accuracy** — but the gap is almost entirely due to **evaluation protocol differences**, not model quality: | Evaluation Protocol | Typical Accuracy | Example Papers | |---|---|---| | **Standard split** (temporal, 175K/82K) | 85–93% | Most published results | | **Random split** (from full 2.28M records) | 97–99.6% | FWIW, some deep learning papers | The random split achieves higher accuracy because: 1. **30% of records are duplicates** — random splitting leaks near-identical flows into both train and test 2. **No temporal shift** — the standard split has temporal separation between training and testing periods Both protocols are valid for different purposes: - **Standard split**: Realistic deployment scenario (train on past, test on future) - **Random split**: Maximum model comparison (controls for temporal shift) ## Configurations ### `temporal` (default) — Original Temporal Split > **Note:** `standard` is an alias for `temporal` — both load the same data. The official train/test CSV files from UNSW-NB15, containing **44 features** (all features except `id`). ```python from datasets import load_dataset ds = load_dataset("lacg030175/UNSW-NB15", "temporal") # or "standard" (alias) # ds["train"]: 175,341 rows # ds["test"]: 82,332 rows ``` - Source: `UNSW_NB15_training-set.csv` and `UNSW_NB15_testing-set.csv` - Used by most published papers for benchmarking - Temporal separation between train and test periods ### `random` — Deduplicated Random Split Full dataset (2.28M records) with duplicates removed, split 80/20 with `random_state=0` and stratified by label. Contains **49 features** including IP addresses and ports. ```python from datasets import load_dataset ds = load_dataset("lacg030175/UNSW-NB15", "random") # ds["train"]: 1,425,833 rows # ds["test"]: 158,426 rows ``` - Source: All four raw UNSW-NB15 CSV files via [Mouwiya/UNSW-NB15](https://huggingface.co/datasets/Mouwiya/UNSW-NB15) - Preprocessing: `drop_duplicates()` reduces 2,280,090 → 1,584,259 records - Split: `train_test_split(test_size=0.1, random_state=0, stratify=label)` - Comparable to FWIW evaluation protocol (Susskind et al., 2023) ## Baseline Results | Model | Standard Split | Random Split | |---|---|---| | Random Forest | 87.2% | 99.6% | | XGBoost | 87.3% | 99.4% | | FWIW WNN (paper) | — | 98.5% | ## Labels - **Binary** (`label`): 0 = Normal, 1 = Attack - **Multi-class** (`attack_cat`): Normal, Analysis, Backdoor, DoS, Exploits, Fuzzers, Generic, Reconnaissance, Shellcode, Worms ### Class Distribution **Standard split (train):** - Normal: 56,000 (32%) | Attack: 119,341 (68%) **Random split (after dedup):** - Normal: 1,523,904 (96%) | Attack: 60,355 (4%) ## Features Both configs include flow-level network features: | Category | Features | Examples | |---|---|---| | Flow | 6 | dur, proto, state, service, sbytes, dbytes | | Content | 8 | sttl, dttl, sloss, dloss, sload, dload, Spkts, Dpkts | | Time | 6 | Sjit, djit, Sintpkt, Dintpkt, tcprtt, synack | | Additional | 13 | ct_srv_src, ct_dst_ltm, is_sm_ips_ports, ... | | Generated | 5+ | trans_depth, res_bdy_len, swin, dwin, ... | The `random` config additionally includes: `srcip`, `dstip`, `sport`, `dsport`, `Stime`, `Ltime`. ## Citation If you use this dataset, please cite the original UNSW-NB15 paper: ```bibtex @inproceedings{moustafa2015unswnb15, title={UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems}, author={Moustafa, Nour and Slay, Jill}, booktitle={Military Communications and Information Systems Conference (MilCIS)}, year={2015}, organization={IEEE} } ``` ## License The original UNSW-NB15 dataset is provided under CC BY 4.0 by the University of New South Wales. This reformatted version preserves the original license.
提供机构:
lacg030175
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作