lacg030175/UNSW-NB15
收藏Hugging Face2026-04-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/lacg030175/UNSW-NB15
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
size_categories:
- 1M<n<10M
task_categories:
- tabular-classification
tags:
- network-intrusion-detection
- cybersecurity
- UNSW-NB15
- IDS
- binary-classification
- multi-class-classification
pretty_name: UNSW-NB15 Network Intrusion Detection
configs:
- config_name: temporal_3way
data_files:
- split: train
path: temporal_3way/train-*
- split: test
path: temporal_3way/test-*
- split: validation
path: temporal_3way/validation-*
default: true
- config_name: random_3way
data_files:
- split: train
path: random_3way/train-*
- split: test
path: random_3way/test-*
- split: validation
path: random_3way/validation-*
- config_name: temporal
data_files:
- split: train
path: temporal/train-*
- split: test
path: temporal/test-*
- config_name: standard
data_files:
- split: train
path: standard/train-*
- split: test
path: standard/test-*
- config_name: random
data_files:
- split: train
path: random/train-*
- split: test
path: random/test-*
---
# UNSW-NB15 Network Intrusion Detection Dataset
The [UNSW-NB15](https://research.unsw.edu.au/projects/unsw-nb15-dataset) dataset for network intrusion detection, provided with **two evaluation protocols** to enable fair comparison across the literature.
## Why This Dataset Exists
Published results on UNSW-NB15 range from **85% to 99% accuracy** — but the gap is almost entirely due to **evaluation protocol differences**, not model quality:
| Evaluation Protocol | Typical Accuracy | Example Papers |
|---|---|---|
| **Standard split** (temporal, 175K/82K) | 85–93% | Most published results |
| **Random split** (from full 2.28M records) | 97–99.6% | FWIW, some deep learning papers |
The random split achieves higher accuracy because:
1. **30% of records are duplicates** — random splitting leaks near-identical flows into both train and test
2. **No temporal shift** — the standard split has temporal separation between training and testing periods
Both protocols are valid for different purposes:
- **Standard split**: Realistic deployment scenario (train on past, test on future)
- **Random split**: Maximum model comparison (controls for temporal shift)
## Configurations
### `temporal` (default) — Original Temporal Split
> **Note:** `standard` is an alias for `temporal` — both load the same data.
The official train/test CSV files from UNSW-NB15, containing **44 features** (all features except `id`).
```python
from datasets import load_dataset
ds = load_dataset("lacg030175/UNSW-NB15", "temporal") # or "standard" (alias)
# ds["train"]: 175,341 rows
# ds["test"]: 82,332 rows
```
- Source: `UNSW_NB15_training-set.csv` and `UNSW_NB15_testing-set.csv`
- Used by most published papers for benchmarking
- Temporal separation between train and test periods
### `random` — Deduplicated Random Split
Full dataset (2.28M records) with duplicates removed, split 80/20 with `random_state=0` and stratified by label. Contains **49 features** including IP addresses and ports.
```python
from datasets import load_dataset
ds = load_dataset("lacg030175/UNSW-NB15", "random")
# ds["train"]: 1,425,833 rows
# ds["test"]: 158,426 rows
```
- Source: All four raw UNSW-NB15 CSV files via [Mouwiya/UNSW-NB15](https://huggingface.co/datasets/Mouwiya/UNSW-NB15)
- Preprocessing: `drop_duplicates()` reduces 2,280,090 → 1,584,259 records
- Split: `train_test_split(test_size=0.1, random_state=0, stratify=label)`
- Comparable to FWIW evaluation protocol (Susskind et al., 2023)
## Baseline Results
| Model | Standard Split | Random Split |
|---|---|---|
| Random Forest | 87.2% | 99.6% |
| XGBoost | 87.3% | 99.4% |
| FWIW WNN (paper) | — | 98.5% |
## Labels
- **Binary** (`label`): 0 = Normal, 1 = Attack
- **Multi-class** (`attack_cat`): Normal, Analysis, Backdoor, DoS, Exploits, Fuzzers, Generic, Reconnaissance, Shellcode, Worms
### Class Distribution
**Standard split (train):**
- Normal: 56,000 (32%) | Attack: 119,341 (68%)
**Random split (after dedup):**
- Normal: 1,523,904 (96%) | Attack: 60,355 (4%)
## Features
Both configs include flow-level network features:
| Category | Features | Examples |
|---|---|---|
| Flow | 6 | dur, proto, state, service, sbytes, dbytes |
| Content | 8 | sttl, dttl, sloss, dloss, sload, dload, Spkts, Dpkts |
| Time | 6 | Sjit, djit, Sintpkt, Dintpkt, tcprtt, synack |
| Additional | 13 | ct_srv_src, ct_dst_ltm, is_sm_ips_ports, ... |
| Generated | 5+ | trans_depth, res_bdy_len, swin, dwin, ... |
The `random` config additionally includes: `srcip`, `dstip`, `sport`, `dsport`, `Stime`, `Ltime`.
## Citation
If you use this dataset, please cite the original UNSW-NB15 paper:
```bibtex
@inproceedings{moustafa2015unswnb15,
title={UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems},
author={Moustafa, Nour and Slay, Jill},
booktitle={Military Communications and Information Systems Conference (MilCIS)},
year={2015},
organization={IEEE}
}
```
## License
The original UNSW-NB15 dataset is provided under CC BY 4.0 by the University of New South Wales. This reformatted version preserves the original license.
提供机构:
lacg030175



