five

Ammar-ss/BRIDGE

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Ammar-ss/BRIDGE
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 tags: - network-security - iot-security - intrusion-detection - botnet-detection - cybersecurity - network-traffic - anomaly-detection - tabular - heterogeneous-benchmark language: - en task_categories: - tabular-classification size_categories: - 100K<n<1M library_name: pytorch pipeline_tag: text-classification --- # BRIDGE: Benchmark Reference for IoT Domain Generalisation Evaluation **Paper:** [BRIDGE and TCH-Net: Heterogeneous Benchmark and Multi-Branch Baseline for Cross-Domain IoT Botnet Detection](https://arxiv.org/abs/2604.11324) **Authors:** Ammar Bhilwarawala, Likhamba Rongmei, Harsh Sharma, Arya Jena, Kaushal Singh, Jayashree Piri, Raghunath Dey — KIIT University **Model:** [Ammar-ss/BRIDGE_and_TCH-Net](https://huggingface.co/Ammar-ss/BRIDGE_and_TCH-Net) **Code:** [github.com/Ammar-ss/TCH-Net](https://github.com/Ammar-ss/TCH-Net) --- ## What is BRIDGE? The network intrusion detection field has been quietly building on shaky ground for years. The overwhelming majority of published systems get evaluated on a single dataset, produce numbers that look great, and then don't transfer when you actually deploy them somewhere else. This isn't just an academic concern — Sommer and Paxson documented the closed-world fragility empirically over a decade ago, and the field has largely kept ignoring it. Part of the reason multi-dataset evaluation hasn't happened is that the datasets are structurally incompatible. CICFlowMeter datasets export bidirectional flow statistics. Argus produces session-level records. Wireshark captures packet-level attributes. Kitsune generates statistical fingerprint vectors with no flow-level correspondence at all. You can't just concatenate these things. Existing approaches either throw everything into PCA and lose all semantic meaning, or do ad-hoc column renaming and introduce silent data integrity violations. BRIDGE is a formally specified attempt to fix both problems at once. It takes five publicly available IoT network security datasets and maps them into a single shared feature space through a 46-feature semantic canonical vocabulary. The mapping uses genuine equivalence only — a feature maps to a canonical slot only if it actually measures the same network-theoretic quantity. Features that don't exist in a given dataset get zero-filled explicitly, and coverage is disclosed fully for every dataset. Nothing is fabricated. The result is a unified benchmark that actually stresses what matters: cross-capture-tool generalisation, cross-device-population generalisation, and cross-time generalisation. The leave-one-dataset-out evaluation protocol, run on TCH-Net and five baseline architectures, reveals that all of them achieve LODO F1 between 0.39 and 0.56. The mean LODO F1 across TCH-Net is 0.5577 — that's the first formally quantified community generalisation baseline in heterogeneous IoT intrusion detection. The field now has a ruler. --- ## Dataset Composition Five datasets, selected deliberately to cover the widest possible range of capture modalities, network environments, device populations, and attack categories. | Dataset | Capture Tool | Year | Coverage | Tier | |---------|-------------|------|----------|------| | CICIDS-2017 | CICFlowMeter | 2017 | 93% (43/46) | Primary | | CIC-IoT-2023 | CICFlowMeter | 2023 | 87% (40/46) | Primary | | Bot-IoT | Argus | 2019 | 39% (18/46) | Primary | | Edge-IIoTset | Wireshark | 2022 | 22% (10/46) | Supplementary | | N-BaIoT | Kitsune | 2018 | 15% (7/46) | Supplementary | The lower coverage in supplementary datasets is expected and intentional. They were captured at the packet level with tools that don't produce flow statistics. Their value is structural — they represent genuinely different network environments and stress the feature alignment approach in ways that CICFlowMeter datasets cannot replicate on their own. ### Why these five? **CICIDS-2017** covers 14 attack types over a five-day testbed. It's the most feature-complete source (93% coverage) and serves as a calibration anchor. Known labelling artefacts exist — Engelen et al. documented them — and we retain it because the multi-dataset evaluation prevents over-reliance on any single source. **CIC-IoT-2023** was built around 105 physical IoT devices under 18 MITRE ATT&CK scenarios. It's the most recent dataset in the suite and reflects constrained, bursty IoT firmware behaviour. **Bot-IoT** was captured with Argus rather than CICFlowMeter — a session-level tool that exports byte counts, duration, and TCP flags but not per-direction flow rates or subflow statistics. This is precisely the cross-capture-tool heterogeneity the vocabulary is designed to bridge. It's the only source imposing a 61% zero-fill regime on the canonical vocabulary. **Edge-IIoTset** records packet-level traffic via Wireshark on Raspberry Pi IIoT nodes running MQTT, Modbus, CoAP, DNP3, and AMQP. Wireshark operates below the flow-aggregation layer, so canonical coverage falls to 22%. IIoT protocols impose strict timing regularity that attacks disrupt in ways that differ sharply from IT-network intrusions. **N-BaIoT** contains pre-computed Kitsune statistical fingerprints — 115-dimensional vectors with no direct CICFlowMeter correspondence, giving 15% canonical coverage. Despite this, N-BaIoT achieves the highest per-dataset F1 (0.9854) under TCH-Net, because Mirai and BASHLITE botnet infections produce stereotyped high-volume traffic that's separable from benign behaviour even in just seven features. --- ## The 46-Feature Canonical Vocabulary All inputs map to this fixed vocabulary, grounded in CICFlowMeter nomenclature. | Group | Semantic Category | Indices | Count | |-------|------------------|---------|-------| | 1 | Flow rates, durations, pkt/byte counts | 0–16 | 17 | | 2 | Packet size & IAT statistics | 17–37 | 21 | | 3 | TCP flag indicators | 38–43 | 6 | | 4 | Header length & window size | 44–45 | 2 | **Group 1 (temporal, T-branch primary):** `flow_duration`, `pkt_count_fwd`, `pkt_count_bwd`, `byte_count_fwd`, `byte_count_bwd`, `pkt_rate`, `byte_rate`, `fwd_pkt_rate`, `bwd_pkt_rate`, `fwd_byte_rate`, `bwd_byte_rate`, `pkt_count_total`, `byte_count_total`, `fwd_pkt_len_total`, `bwd_pkt_len_total`, `subflow_fwd_pkts`, `subflow_bwd_pkts` **Group 2 (statistical, H-branch primary):** `pkt_len_min`, `pkt_len_max`, `pkt_len_mean`, `pkt_len_std`, `pkt_len_var`, `fwd_pkt_len_min`, `fwd_pkt_len_max`, `fwd_pkt_len_mean`, `fwd_pkt_len_std`, `bwd_pkt_len_min`, `bwd_pkt_len_max`, `bwd_pkt_len_mean`, `bwd_pkt_len_std`, `iat_mean`, `iat_std`, `iat_max`, `iat_min`, `fwd_iat_mean`, `fwd_iat_std`, `bwd_iat_mean`, `bwd_iat_std` **Group 3:** `flag_syn`, `flag_ack`, `flag_fin`, `flag_rst`, `flag_psh`, `flag_urg` **Group 4:** `fwd_header_len`, `init_win_fwd` Three explicit constraints govern the mapping. Genuine equivalence only — superficially similar but semantically distinct quantities are never mapped. Explicit zero-filling — absent features are set to zero and coverage is disclosed. No dimensionality reduction — PCA is excluded because it destroys semantic interpretability. --- ## Preprocessing Pipeline **1. Class Balancing** Each dataset is balanced independently to a strict 1:1 benign-to-attack ratio by downsampling the majority class. A minimum of 5,000 samples per class are preserved. The 1:1 ratio was selected after pilot experiments with 3:1 and 1:3 ratios revealed class collapse in datasets with low initial attack proportions (CICIDS-2017 starts at 14.5% attack, which produces window attack incidence below 10% without strict balancing). **2. Semantic Vector Construction** Each record is mapped to a 46-dimensional canonical vector. All values are parsed as float32. Non-numeric, NaN, and infinite values are replaced with zero. **3. Train/Test Split** 80/20 stratified random split on the combined data, performed before scaling to prevent leakage. **4. Normalisation** A single `RobustScaler(quantile_range=(5, 95))` is fitted exclusively on the training split and applied to both without refitting. Values are clipped to [−10, 10]. A shared scaler is used deliberately — per-dataset scaling would normalise away inter-dataset distributional differences that carry discriminative information, and would constitute leakage in the LODO protocol. **5. Leakage Verification** Three checks applied and all passed: scaler fitted before any test-set access; hash-based overlap detection confirming zero identical feature vectors between train and test; benign/attack ratio consistent between splits (train 0.758, test 0.750). **6. Sequence Construction** A sliding window of W=32, stride S=4 is applied within each dataset's records, producing sequences of shape (N_seq, 32, 46). Window labels are assigned by majority vote. Training sequences are capped at 800,000 and test sequences at 200,000. --- ## Post-Balancing Record Counts | Dataset | Benign | Attack | Total | Atk% | |---------|--------|--------|-------|------| | CICIDS-2017 | 19,321 | 14,350 | 33,671 | 42.6% | | CIC-IoT-2023 | 3,964 | 3,001 | 6,965 | 43.1% | | Bot-IoT | 22 | 16 | 38 | 42.1% | | Edge-IIoTset | 30,951 | 23,435 | 54,386 | 43.1% | | N-BaIoT | 13,557 | 10,055 | 23,612 | 42.6% | | **Combined** | **67,815** | **50,857** | **118,672** | **42.9%** | **Note on Bot-IoT:** Post-balancing yields only 38 records (22 benign, 16 attack). Bot-IoT contributes minimal window sequences to the training set. Its value in BRIDGE is structural. It represents Argus-captured session-level traffic, which imposes a 61% zero-fill regime and tests cross-capture-tool generalisation. It is not a statistically dominant training source. --- ## Attack Types Covered Across the five datasets BRIDGE covers a broad range of attack categories including DDoS (flooding, UDP, ICMP), DoS (slowloris, Hulk), botnet C&C (Mirai, BASHLITE, Satori variants), port scanning (SYN, FIN), brute force (SSH, FTP, HTTP), web attacks (SQL injection, XSS, command injection), infiltration and backdoor activity, data exfiltration, reconnaissance, and industrial IIoT protocol attacks (MQTT, Modbus probing). --- ## Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("Ammar-ss/BRIDGE") ``` Or directly as a CSV: ```python import pandas as pd df = pd.read_csv("hf://datasets/Ammar-ss/BRIDGE/BRIDGE.csv") ``` The CSV contains the 46 canonical features, a `label` column (0 = benign, 1 = attack), and a `dataset_source` column indicating which of the five original datasets each row came from. --- ## Intended Use BRIDGE is intended for benchmarking IDS and botnet detection models across heterogeneous network data, studying cross-domain generalisation in network security, evaluating feature alignment strategies for multi-source traffic data, and reproducing experiments from the accompanying paper. The BRIDGE LODO mean F1 of **0.5577** (TCH-Net) is proposed as a formally quantified community baseline. It's not a ceiling — it's a starting point. Domain adversarial training and dataset-conditional normalisation are the most directly motivated directions for improving on it, and BRIDGE and the canonical vocabulary provide the infrastructure to measure that progress reproducibly. It is **not** intended for deployment as a production IDS without additional evaluation on your specific network environment. All five source datasets were collected in controlled testbed conditions. --- ## Source Datasets and Acknowledgements BRIDGE is derived from five public datasets. Each has its own license — please review them before any commercial or derivative use. | Dataset | Official Page | Kaggle | |---------|--------------|--------| | CICIDS-2017 | [UNB CIC](https://www.unb.ca/cic/datasets/ids-2017.html) | [dhoogla/cicids2017](https://www.kaggle.com/datasets/dhoogla/cicids2017) | | CIC-IoT-2023 | [UNB CIC](https://www.unb.ca/cic/datasets/iotdataset-2023.html) | [raqeeb24/ciciot-2023-stratified-dataset](https://www.kaggle.com/datasets/raqeeb24/ciciot-2023-stratified-dataset) | | Bot-IoT | [UNSW Canberra](https://research.unsw.edu.au/projects/bot-iot-dataset) | [vigneshvenkateswaran/bot-iot-5-data](https://www.kaggle.com/datasets/vigneshvenkateswaran/bot-iot-5-data) | | Edge-IIoTset | [IEEE DataPort](https://ieee-dataport.org/documents/edge-iiotset-new-comprehensive-realistic-cyber-security-dataset-iot-and-iiot-applications) | [mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot](https://www.kaggle.com/datasets/mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot) | | N-BaIoT | [UCI ML Repo](https://archive.ics.uci.edu/ml/datasets/detection_of_IoT_botnet_attacks_N_BaIoT) | [mkashifn/nbaiot-dataset](https://www.kaggle.com/datasets/mkashifn/nbaiot-dataset) | This unified benchmark is released under Apache 2.0. --- ## Citation ```bibtex @article{bhilwarawala2026bridge, title = {{BRIDGE} and {TCH-Net}: Heterogeneous Benchmark and Multi-Branch Baseline for Cross-Domain {IoT} Botnet Detection}, author = {Bhilwarawala, Ammar and Rongmei, Likhamba and Sharma, Harsh and Jena, Arya and Singh, Kaushal and Piri, Jayashree and Dey, Raghunath}, journal = {arXiv preprint arXiv:2604.11324}, year = {2026} } ```
提供机构:
Ammar-ss
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作