five

it4lia/BODMAS_cleaned

收藏
Hugging Face2026-03-30 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/it4lia/BODMAS_cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en pretty_name: BODMAS Cleaned task_categories: - tabular-classification tags: - cybersecurity - malware - static-analysis - pe-files - malware-detection - malware-family-classification - concept-drift - temporal-analysis - tabular - ai-ready - tabular-classification size_categories: - 100K<n<1M --- # BODMAS Cleaned BODMAS Cleaned is a cleaned and analysis-ready version of the original **BODMAS (Blue Hexagon Open Dataset for Malware Analysis)** dataset. The original BODMAS dataset was introduced for machine-learning-based static malware analysis on Windows Portable Executable (PE) files, with support not only for binary malware detection but also for **temporal analysis** and **malware family studies**. This cleaned release preserves that analytical value while making the data easier to load, validate, and reuse. Compared with the original source asset, this release removes duplicate samples, drops constant features, standardizes metadata, and preserves both timestamp and family information. Each sample is represented as a fixed-length numerical vector extracted statically from the original PE file, without executing the binary. ## Original dataset This dataset is a cleaned derivative of the original BODMAS dataset: - **Original name:** BODMAS (Blue Hexagon Open Dataset for Malware Analysis) - **Original providers:** University of Illinois at Urbana-Champaign (UIUC) / Blue Hexagon - **Original paper:** Yang et al. (2021). *BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware* - **Original paper link:** https://liminyang.web.illinois.edu/data/DLS21_BODMAS.pdf - **Original website/repository:** https://whyisyoung.github.io/BODMAS/ Please cite the original BODMAS paper when using this cleaned release in research. ## Files | File | Description | |---|---| | `bodmas_clean.npz` | Index file with row/feature counts and file references | | `bodmas_clean_X.npy` | Feature matrix (`float32`), raw memmap, shape `(134428, 2326)` | | `bodmas_clean_y.npy` | Label vector (`int32`), `0 = benign`, `1 = malware` | | `bodmas_clean_metadata.parquet` | Per-sample metadata: SHA-256, timestamps, family fields, and quality flags | | `manifest.json` | Versioned manifest with checksums and artifact references | | `bodmas_cleaned_dataset.ipynb` | Exploration and usage notebook | ## What’s in the dataset? This cleaned release contains a fully labeled static malware dataset derived from the original BODMAS collection. ### Core contents - **134,428 labeled samples** - **2,326 numerical features** - feature dtype: `float32` - label dtype: `int32` - labels: - `0` = benign - `1` = malware ### Label distribution - **77,138 benign samples** - **57,290 malware samples** ### Metadata retained The cleaned metadata preserves the fields that make BODMAS especially useful for temporal and family-aware analysis, including: - `sha256` - `timestamp` - `family` - `record_id` - `sample_id` - `label_int` - `label_str` - `timestamp_raw` - `hash_type` - `bodmas_family` - `flag_sha256` - `flag_timestamp_missing` - `flag_family_missing` - `flag_is_benign` - `feature_hash` ### Family information - **580 unique malware families** are present in the cleaned dataset - the family distribution is skewed: - top 5 families cover **35.5%** of malware samples - top 20 families cover **69.4%** of malware samples ### Feature representation Samples are not raw executables. Each file is represented as a **fixed-length static feature vector** extracted from the original PE file. These features describe structural and statistical properties of the binary, such as: - PE headers - sections - imports - entropy-related information - histogram-based characteristics ## Cleaning summary This release is the output of a quality-control and harmonization pipeline applied to the original BODMAS artifacts. Main processing steps: 1. **Duplicate removal** 2. **Constant-feature filtering**, reducing the feature space from 2,381 to **2,326** 3. **Metadata standardization** 4. **Missing-value normalization and quality flagging** 5. **Family/label consistency checks** 6. **Manifest generation** for reproducibility and integrity checks Summary of the cleaned release: - original source size: **134,435 labeled samples** - final cleaned size: **134,428 samples** - final feature space: **2,326 features** - **no separate unlabeled split** - timestamps and malware family metadata preserved In the cleaned metadata, family-label consistency is also checked using the BODMAS convention that empty family values correspond to benign samples, while non-empty family values correspond to malware samples. ## File structure ```text BODMAS_cleaned/ ├── bodmas_clean.npz ├── bodmas_clean_X.npy ├── bodmas_clean_y.npy ├── bodmas_clean_metadata.parquet └── manifest.json ``` The `.npz` index stores `_rows` and `_features` for reliable loading. The feature matrix is a raw memmap-backed array and should be loaded with explicit dtype and shape. Unlike EMBER Cleaned, there is **no separate unlabeled split** in BODMAS Cleaned. ## Requirements To run the quickstart examples, install the minimum required dependencies: ```bash pip install numpy pandas pyarrow ``` For notebook-based exploration and basic visualization, you may also install: ```bash pip install jupyter matplotlib seaborn scikit-learn ``` ## Quickstart This example loads the labeled BODMAS Cleaned dataset and checks that features, labels, and metadata are consistent and ready for supervised use. ```python import numpy as np import pandas as pd idx = np.load("bodmas_clean.npz", allow_pickle=True) n_rows = int(idx["_rows"]) n_features = int(idx["_features"]) X = np.fromfile("bodmas_clean_X.npy", dtype=np.float32) assert X.size == n_rows * n_features, ( f"Unexpected X size: got {X.size}, expected {n_rows * n_features}" ) X = X.reshape(n_rows, n_features) meta = pd.read_parquet("bodmas_clean_metadata.parquet") y = meta["label_int"].to_numpy(dtype=np.int32, copy=False) print( f"Dataset: {X.shape[0]} samples, {X.shape[1]} features | " f"labels: {len(y)} | metadata columns: {meta.shape[1]}" ) assert X.shape[0] == len(y) == len(meta) assert set(np.unique(y)) == {0, 1} print("Unique labels:", np.unique(y)) print("Metadata columns:", meta.columns.tolist()) print("All checks passed.") ``` ## Notebook The repository also includes an exploration notebook in `.ipynb` format, designed to provide additional context on the cleaned dataset, its structure, and its main analytical use cases. The notebook can be used to: - inspect the labeled dataset - explore metadata fields, timestamps, and family distributions - validate dataset consistency - review temporal and family-aware analyses - explore example downstream use cases To open it locally, run: ```bash jupyter notebook bodmas_cleaned_dataset.ipynb ``` or, if you use JupyterLab: ```bash jupyter lab bodmas_cleaned_dataset.ipynb ``` Make sure to open the notebook from the dataset root directory so that relative file paths resolve correctly. ## Typical use cases BODMAS Cleaned supports: - binary malware detection - malware family classification - temporal analysis - concept drift studies - time-aware validation - family distribution analysis - feature importance analysis The accompanying notebook includes loading, exploratory data analysis, temporal and family analysis, and example use cases focused on discriminative signal and model evaluation. ## Notes and limitations - This is a **static-analysis** dataset only. - The cleaned release contains **derived features**, not raw PE binaries. - Family names are dataset-specific and should not be treated as a universal malware ontology. - Temporal metadata should be respected during evaluation to avoid leakage. - Family distribution is concentrated, so downstream family-level evaluation should account for skew. - The dataset is intended for defensive research, benchmarking, and education. ## License This cleaned release is derived from BODMAS. The original BODMAS data files are associated with the BSD-2 License. Please verify that your downstream redistribution and reuse remain aligned with the original BODMAS terms. ## References If you use this dataset, please cite the original BODMAS paper: ```bibtex @inproceedings{yang2021bodmas, title={BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware}, author={Yang, Limin and others}, booktitle={Proceedings of the 2021 ACM Workshop on Data-Limited Security Research}, year={2021} } ``` APA: Yang, L., et al. (2021). *BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware*. In *Proceedings of the 2021 ACM Workshop on Data-Limited Security Research*. ## Contacts - **Shared by:** ACN
提供机构:
it4lia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作