five

Streaming Pipeline Anomaly Detection Benchmark: Kafka and Flink Telemetry Dataset

收藏
DataCite Commons2026-05-02 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.19968620
下载链接
链接失效反馈
官方服务:
资源简介:
This deposit accompanies the manuscript submitted to Elsevier Array: "ML-based anomaly detection for streaming pipelines: an empirical benchmark with cross-workload generalization on Apache Kafka and Flink". It contains two complete fault-injection campaigns (e-commerce and industrial IoT) on Apache Kafka 3.9 (KRaft) and Apache Flink 1.19, plus engineered features, analysis outputs, and the code that produced them. This is v6 of the deposit. v6 is the version of record for the Array submission and adds the empirical material for the new Section 4.10 "Generalisation stress tests": classical baselines (EWMA, CUSUM, Matrix Profile, POT, COPOD, HBOS, LODA), threshold-LOFO across all 8 held-out faults, full-HPO LOFO sanity check on three representative faults, and three-state evaluation (active / recovery / hybrid / binary regimes). v6 also gap-fills Deep SVDD and DAGMM at 15s/60s/120s on e-commerce. The datasets are byte-identical to v5; only code, analysis outputs, and the manuscript PDF differ. See CHANGELOG.md for the full v5 to v6 diff. Layout ecommerce_v6.tar.gz # e-commerce data (identical to v5) baseline_24h.parquet campaign_v1_metrics.parquet campaign_v2_metrics.parquet campaign_manifest_{v1,v2}.json features/ features_{15,30,60,120}s.parquet selection_report_*s.json iot_v6.tar.gz # IoT data (identical to v5) baseline_24h_iot.parquet baseline_stats_iot.json campaign_iot.parquet campaign_manifest_iot.json orchestrator_stats_iot.json features/ features_{15,30,60,120,300}s.parquet selection_report_*s.json analysis_outputs_v6.tar.gz # all CSVs cited in the manuscript panel_overall.csv panel_per_fault.csv subexp{1,2,3,4}_*.csv rq2_mixedeffects.csv curated_rules_v2_*.csv curated_rules_per_rule_tuned_*.csv latency_fixed_fpr_*.csv prevalence_sweep.csv, prevalence_sweep_bootstrap.csv classical_baselines_summary.csv, classical_baselines_per_fold.csv # NEW in v6 threshold_lofo_summary.csv, threshold_lofo_per_fold.csv # NEW in v6 three_state_eval_summary.csv, three_state_eval_per_fold.csv # NEW in v6 full_lofo_summary.csv, lofo_full_vs_threshold.csv # NEW in v6 lofo_full_vs_threshold.md # NEW in v6 FINDINGS.md code_v6.tar.gz # all scripts that produced the analysis outputs sagemaker/ training.py # now with --lofo-held-out flag (v6) bootstrap_v3_lofo.sh, setup_v3_lofo.sh # NEW in v6 bootstrap_v3_ecom_gapfill_multiwindow.sh, setup_*.sh # NEW in v6 plus the original v5 sagemaker scripts analysis/cross_workload/ classical_baselines.py # NEW in v6 lofo_analysis.py # NEW in v6 batch_inference.py # NEW in v6 three_state_eval.py # NEW in v6 plus the original v5 analysis scripts (curated_rules_*, latency_fixed_fpr, prevalence_sweep, rq2_mixedeffects, subexp{1-4}, merge_panels, regen_*) manuscript_array_v6.pdf # 25-page Array submission (post-Path-B) Quick start tar xzf ecommerce_v6.tar.gz tar xzf iot_v6.tar.gz tar xzf analysis_outputs_v6.tar.gz tar xzf code_v6.tar.gz python -c " import pandas as pd df = pd.read_parquet('ecommerce/features/features_30s.parquet') print(df.shape, df.label.value_counts().to_dict()) " Verifying the manuscript's headline numbers Manuscript artefact File Table 7 (RQ1, e-commerce 30s) panel_overall.csv (filter workload=ecommerce, window_s=30) Table 8 (RQ2 mixed effects) rq2_mixedeffects.csv Table 11 (cooldown / recovery exclusion) panel_overall.csv (RECOVERY-excl rows) Table 12 (computational cost) panel_overall.csv (train_time_s) Table 13 (classical baselines) classical_baselines_summary.csv (NEW in v6) Table 14 (threshold-LOFO) threshold_lofo_summary.csv (NEW in v6) Table 15 (full-HPO LOFO 3-fault) full_lofo_summary.csv (NEW in v6) Table 16 (three-state evaluation) three_state_eval_summary.csv (NEW in v6) Table 17 (IoT 30s RQ1) panel_overall.csv (filter workload=iot, window_s=30) Table 18 (cross-train transfer) subexp2_cross_train_auc_table.csv Section 4.11 Spearman rho subexp3_per_fault_spearman.csv Section 4.11 optimal windows subexp4_optimal_windows.csv Curated-rule baseline (raw) curated_rules_v2_*_per_fold.csv Curated-rule baseline (per-rule tuned) curated_rules_per_rule_tuned_summary.csv Section 4.4 fixed-FPR latency latency_fixed_fpr_summary.csv Reproducing the analyses cd code/analysis/cross_workload python rq2_mixedeffects.py python curated_rules_v2.py python curated_rules_per_rule_tuned.py python subexp1_bucket_consistency.py python subexp2_cross_train.py python subexp34_rank_correlation.py python classical_baselines.py python lofo_analysis.py Requires Python 3.10+ with pandas, numpy, scikit-learn, statsmodels, scipy, torch, pyod, stumpy. Models in the panel Eight semi-supervised: Isolation Forest, One-Class SVM, LSTM-AE, Transformer-AE, 1D-CNN-AE, LSTM-VAE, Deep SVDD, DAGMM. Plus a supervised Random Forest reference, a curated production-style rule baseline (Burrow / Xinfra Monitor / Ververica-style: consumer lag, UnderReplicatedPartitions, failed-checkpoint delta, backpressure, broker offline, rebalance burst, JVM heap pressure), and a statistical max-z diagnostic. In v6 the full eight-model panel is evaluated at 15s, 30s, 60s, and 120s on e-commerce and at 15s, 30s, 60s, 120s, and 300s on IoT. The 300s window remains absent on e-commerce because that grid point was added after the e-commerce campaign concluded. The asymmetry is documented in §6.3 (Threats to Validity). Companion code repository https://github.com/mateenali66/streaming-anomaly-detection The GitHub repository tracks the same code as code_v6.tar.gz and is the recommended starting point for development. The Zenodo deposit is the citable archive snapshot.
提供机构:
Zenodo
创建时间:
2026-05-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作