Streaming Pipeline Anomaly Detection Benchmark: Kafka and Flink Telemetry Dataset
收藏DataCite Commons2026-05-02 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.19968620
下载链接
链接失效反馈官方服务:
资源简介:
This deposit accompanies the manuscript submitted to Elsevier Array: "ML-based anomaly detection for streaming pipelines: an empirical benchmark with cross-workload generalization on Apache Kafka and Flink". It contains two complete fault-injection campaigns (e-commerce and industrial IoT) on Apache Kafka 3.9 (KRaft) and Apache Flink 1.19, plus engineered features, analysis outputs, and the code that produced them.
This is v6 of the deposit. v6 is the version of record for the Array submission and adds the empirical material for the new Section 4.10 "Generalisation stress tests": classical baselines (EWMA, CUSUM, Matrix Profile, POT, COPOD, HBOS, LODA), threshold-LOFO across all 8 held-out faults, full-HPO LOFO sanity check on three representative faults, and three-state evaluation (active / recovery / hybrid / binary regimes). v6 also gap-fills Deep SVDD and DAGMM at 15s/60s/120s on e-commerce. The datasets are byte-identical to v5; only code, analysis outputs, and the manuscript PDF differ. See CHANGELOG.md for the full v5 to v6 diff.
Layout
ecommerce_v6.tar.gz # e-commerce data (identical to v5)
baseline_24h.parquet
campaign_v1_metrics.parquet
campaign_v2_metrics.parquet
campaign_manifest_{v1,v2}.json
features/
features_{15,30,60,120}s.parquet
selection_report_*s.json
iot_v6.tar.gz # IoT data (identical to v5)
baseline_24h_iot.parquet
baseline_stats_iot.json
campaign_iot.parquet
campaign_manifest_iot.json
orchestrator_stats_iot.json
features/
features_{15,30,60,120,300}s.parquet
selection_report_*s.json
analysis_outputs_v6.tar.gz # all CSVs cited in the manuscript
panel_overall.csv
panel_per_fault.csv
subexp{1,2,3,4}_*.csv
rq2_mixedeffects.csv
curated_rules_v2_*.csv
curated_rules_per_rule_tuned_*.csv
latency_fixed_fpr_*.csv
prevalence_sweep.csv, prevalence_sweep_bootstrap.csv
classical_baselines_summary.csv, classical_baselines_per_fold.csv # NEW in v6
threshold_lofo_summary.csv, threshold_lofo_per_fold.csv # NEW in v6
three_state_eval_summary.csv, three_state_eval_per_fold.csv # NEW in v6
full_lofo_summary.csv, lofo_full_vs_threshold.csv # NEW in v6
lofo_full_vs_threshold.md # NEW in v6
FINDINGS.md
code_v6.tar.gz # all scripts that produced the analysis outputs
sagemaker/
training.py # now with --lofo-held-out flag (v6)
bootstrap_v3_lofo.sh, setup_v3_lofo.sh # NEW in v6
bootstrap_v3_ecom_gapfill_multiwindow.sh, setup_*.sh # NEW in v6
plus the original v5 sagemaker scripts
analysis/cross_workload/
classical_baselines.py # NEW in v6
lofo_analysis.py # NEW in v6
batch_inference.py # NEW in v6
three_state_eval.py # NEW in v6
plus the original v5 analysis scripts (curated_rules_*, latency_fixed_fpr,
prevalence_sweep, rq2_mixedeffects, subexp{1-4}, merge_panels, regen_*)
manuscript_array_v6.pdf # 25-page Array submission (post-Path-B)
Quick start
tar xzf ecommerce_v6.tar.gz
tar xzf iot_v6.tar.gz
tar xzf analysis_outputs_v6.tar.gz
tar xzf code_v6.tar.gz
python -c "
import pandas as pd
df = pd.read_parquet('ecommerce/features/features_30s.parquet')
print(df.shape, df.label.value_counts().to_dict())
"
Verifying the manuscript's headline numbers
Manuscript artefact
File
Table 7 (RQ1, e-commerce 30s)
panel_overall.csv (filter workload=ecommerce, window_s=30)
Table 8 (RQ2 mixed effects)
rq2_mixedeffects.csv
Table 11 (cooldown / recovery exclusion)
panel_overall.csv (RECOVERY-excl rows)
Table 12 (computational cost)
panel_overall.csv (train_time_s)
Table 13 (classical baselines)
classical_baselines_summary.csv (NEW in v6)
Table 14 (threshold-LOFO)
threshold_lofo_summary.csv (NEW in v6)
Table 15 (full-HPO LOFO 3-fault)
full_lofo_summary.csv (NEW in v6)
Table 16 (three-state evaluation)
three_state_eval_summary.csv (NEW in v6)
Table 17 (IoT 30s RQ1)
panel_overall.csv (filter workload=iot, window_s=30)
Table 18 (cross-train transfer)
subexp2_cross_train_auc_table.csv
Section 4.11 Spearman rho
subexp3_per_fault_spearman.csv
Section 4.11 optimal windows
subexp4_optimal_windows.csv
Curated-rule baseline (raw)
curated_rules_v2_*_per_fold.csv
Curated-rule baseline (per-rule tuned)
curated_rules_per_rule_tuned_summary.csv
Section 4.4 fixed-FPR latency
latency_fixed_fpr_summary.csv
Reproducing the analyses
cd code/analysis/cross_workload
python rq2_mixedeffects.py
python curated_rules_v2.py
python curated_rules_per_rule_tuned.py
python subexp1_bucket_consistency.py
python subexp2_cross_train.py
python subexp34_rank_correlation.py
python classical_baselines.py
python lofo_analysis.py
Requires Python 3.10+ with pandas, numpy, scikit-learn, statsmodels, scipy, torch, pyod, stumpy.
Models in the panel
Eight semi-supervised: Isolation Forest, One-Class SVM, LSTM-AE, Transformer-AE, 1D-CNN-AE, LSTM-VAE, Deep SVDD, DAGMM. Plus a supervised Random Forest reference, a curated production-style rule baseline (Burrow / Xinfra Monitor / Ververica-style: consumer lag, UnderReplicatedPartitions, failed-checkpoint delta, backpressure, broker offline, rebalance burst, JVM heap pressure), and a statistical max-z diagnostic.
In v6 the full eight-model panel is evaluated at 15s, 30s, 60s, and 120s on e-commerce and at 15s, 30s, 60s, 120s, and 300s on IoT. The 300s window remains absent on e-commerce because that grid point was added after the e-commerce campaign concluded. The asymmetry is documented in §6.3 (Threats to Validity).
Companion code repository
https://github.com/mateenali66/streaming-anomaly-detection
The GitHub repository tracks the same code as code_v6.tar.gz and is the recommended starting point for development. The Zenodo deposit is the citable archive snapshot.
提供机构:
Zenodo
创建时间:
2026-05-02



