five

HandsomeWin/binary-30k

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/HandsomeWin/binary-30k
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 size_categories: - 10K<n<100K task_categories: - text-classification tags: - binary-analysis - malware-detection - cybersecurity - cross-platform - tokenized - stratified-splits --- # Binary-30K: Cross-Platform Binary Dataset with Stratified Splits [Paper](https://huggingface.co/papers/2511.22095) | [Code](https://github.com/mjbommar/binary-dataset-paper) **🔗 Original Dataset (no splits):** [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) This is the **stratified train/validation/test split version** of the Binary-30K dataset, containing **29,793 unique cross-platform binaries** with pre-computed tokenization. This version provides standardized splits for reproducible machine learning research. ## 🎯 Key Features - ✅ **Stratified 70/15/15 splits** maintaining class balance across all sets - ✅ **4-dimensional stratification** across malware/platform/format/architecture - ✅ **26.9% malware balance** preserved in all splits (±0.1%) - ✅ **Deterministic splits** (seed=42) for reproducible research - ✅ **Ready for ML** - no manual splitting required - ✅ **Pre-computed BPE tokenization** for transformer models ## 📊 Dataset Splits | Split | Samples | Malware | Benign | Malware % | |-------|---------|---------|--------|-----------| | **Train** | 20,849 | 5,613 | 15,236 | 26.92% | | **Validation** | 4,463 | 1,200 | 3,263 | 26.89% | | **Test** | 4,481 | 1,210 | 3,271 | 27.00% | | **Total** | 29,793 | 8,023 | 21,770 | 26.93% | ### Stratification Strategy Splits maintain proportional representation across: - ✅ **Malware vs. Benign** (26.9% malware in each split) - ✅ **Platform** (Windows, Linux, macOS, Android, Other) - ✅ **File Format** (PE, ELF, Mach-O, APK) - ✅ **Architecture Groups** (common: x86/ARM vs. exotic: MIPS/RISC-V/PowerPC) **19 unique strata identified** with proportional representation maintained across all splits. ## 🚀 Quick Start ```python from datasets import load_dataset # Load dataset with splits dataset = load_dataset("mjbommar/binary-30k") train_ds = dataset["train"] # 20,849 samples val_ds = dataset["validation"] # 4,463 samples test_ds = dataset["test"] # 4,481 samples # Access pre-computed tokens sample = train_ds[0] print(f"Platform: {sample['platform']}") print(f"Malware: {sample['is_malware']}") print(f"Tokens: {len(sample['tokens'])} tokens") ``` ### Example: Malware Classification ```python from datasets import load_dataset from transformers import Trainer, TrainingArguments # Load data dataset = load_dataset("mjbommar/binary-30k") # Tokens are pre-computed - just truncate def prepare_example(example): return { "input_ids": example["tokens"][:512], "labels": int(example["is_malware"]) } # Train on standard splits train_ds = dataset["train"].map(prepare_example) val_ds = dataset["validation"].map(prepare_example) # Train your model... ``` ### Example: Cross-Platform Transfer Learning ```python # Train on Windows, test on Linux train_windows = dataset["train"].filter(lambda x: x["platform"] == "windows") test_linux = dataset["test"].filter(lambda x: x["platform"] == "linux") print(f"Windows training samples: {len(train_windows)}") print(f"Linux test samples: {len(test_linux)}") # Evaluate cross-platform generalization... ``` ## 📦 Dataset Composition **Platform Distribution:** - **Windows**: 57.3% (17,239 samples) - PE32/PE32+ executables and DLLs - **Linux**: 28.4% (8,452 samples) - ELF32/ELF64 from 9 distributions - **macOS**: 1.9% (568 samples) - x86-64, ARM64, Universal binaries - **Android**: 0.6% (164 samples) - APKs with native ARM libraries - **Other**: 11.8% (3,370 samples) - Diverse formats and installers **Architecture Diversity:** - **Common**: x86-64 (56.4%), x86 (11.1%), ARM64 (5.9%), ARM (9.4%) - **Exotic**: MIPS (2.3%), PowerPC (1.3%), RISC-V (0.1%), m68k, SuperH, ARCompact, SPARC, S/390 **Malware Sources:** - **SOREL-20M**: 365 Windows PE malware samples (2017-2019) - **Malware Bazaar**: 7,658 cross-platform malware samples (2020-2024) - Platform-first stratified sampling - ALL available macOS malware (560 samples) - ALL available Android malware (164 samples) - Balanced Windows/Linux with size stratification ## 📋 Data Structure Each record contains **31 fields** organized into seven categories: **Identification** (6 fields): - `file_id`, `file_path`, `file_name`, `sha256`, `md5`, `file_size` **Platform/Source** (5 fields): - `platform`, `os_family`, `os_version`, `distribution`, `is_malware` **File Characteristics** (6 fields): - `file_format`, `architecture`, `binary_type`, `is_stripped`, `is_packed`, `is_signed` **Structural Analysis** (4 fields + sections): - `num_sections`, `code_size`, `data_size`, `sections[]` **Dependencies** (4 fields + imports/exports): - `num_imports`, `num_exports`, `imports[]`, `exports[]` **Complexity** (1 field): - `entropy` (Shannon entropy 0-8 scale) **Pre-computed Tokenization** (4 fields): - `tokens[]`, `token_count`, `compression_ratio`, `unique_tokens` **Parser Diagnostics** (2 fields): - `parse_status`, `parse_warnings[]` ### Pre-computed Tokenization All binaries are tokenized using **BPE tokenization** ([`mjbommar/binary-tokenizer-001-64k`](https://huggingface.co/mjbommar/binary-tokenizer-001-64k)): - **Average tokens per binary**: ~15,000 - **Compression ratio**: ~4.2 bytes/token - **Vocabulary**: 64K tokens - **Ready for transformers**: BERT, GPT, T5, etc. ## 🎓 Supported Research Tasks 1. **Malware Detection**: Binary classification with balanced classes (26.9% malware) 2. **Cross-Platform Analysis**: Transfer learning across Windows/Linux/macOS/Android 3. **Architecture-Invariant Detection**: Generalization to exotic architectures (IoT/embedded) 4. **Mobile Malware Research**: Dedicated Android and macOS malware samples 5. **Binary Similarity**: Embedding learning for similar binary detection 6. **Format-Agnostic Analysis**: Multi-format models (PE/ELF/Mach-O/APK) ## 📊 Comparison with Other Datasets | Dataset | Size | Platforms | Architectures | Malware | Pre-tokenized | Splits | |---------|------|-----------|---------------|---------|---------------|--------|\ | **Binary-30K** | 30K | Win+Linux+macOS+Android | 15+ (incl. exotic) | 26.9% | ✅ | ✅ | | SOREL-20M | 20M | Windows only | x86/x64 | 100% | ❌ | ❌ | | EMBER | 1.1M | Windows only | x86/x64 | 50% | ❌ (features) | ✅ | | Assemblage | 1.1M | Windows+Linux | x86/x64 | 0% (benign) | ❌ | ❌ | ## 🔍 Stratification Verification **Split Distribution Verification:** **TRAIN (20,849 samples):** - Malware: 5,613 (26.92%) - Top platforms: Windows (12,065), Linux (5,915), Other (1,200) - Top formats: PE (12,018), ELF (5,915), Unknown (1,195) **VALIDATION (4,463 samples):** - Malware: 1,200 (26.89%) - Top platforms: Windows (2,584), Linux (1,266), Other (256) - Top formats: PE (2,574), ELF (1,266), Unknown (255) **TEST (4,481 samples):** - Malware: 1,210 (27.00%) - Top platforms: Windows (2,590), Linux (1,271), Other (259) - Top formats: PE (2,577), ELF (1,271), Unknown (258) **Statistical Tests:** Chi-square tests confirm no significant deviation from proportional representation (p > 0.05 for all dimensions). ## 🔄 Reproducibility **Split Generation:** - **Seed**: 42 (for reproducibility) - **Method**: Stratified sampling with composite keys - **Date**: November 15, 2025 - **Tool**: [`binary-dataset-paper`](https://github.com/mjbommar/binary-dataset-paper) All splits are **deterministic and reproducible**. Using the same seed will always produce identical splits. ## 📚 Data Sources **Linux Binaries:** Alpine 3.18/3.19, Debian 11-12, Ubuntu 20.04/22.04/24.04, Fedora 39-40, CentOS Stream 9, Arch Linux, Kali Linux 2024.1, Parrot OS 6.0, BusyBox 1.37.0 **Windows Binaries:** Windows 8 Pro, Windows 10 21H2/22H2, Windows 11 23H2, Windows Update Catalog **Malware Samples:** - SOREL-20M dataset (Sophos-ReversingLabs, 2020) - Malware Bazaar (abuse.ch, 2020-2024) with platform-first stratified sampling ## ⚠️ Important Considerations **Limitations:** - Static analysis only (no dynamic/runtime behavior) - Some binaries cannot be parsed by LIEF - Many binaries have stripped debug symbols - Very large binaries produce extended token sequences - iOS/iPadOS binaries not included - Uneven representation of exotic architectures **Usage Notes:** - **Malware samples require secure, isolated research environments** - Windows binaries subject to Microsoft licensing terms - Fair use application depends on jurisdiction - Splits are standardized but users may create custom splits for specific research needs ## 📄 License and Attribution **Dataset Compilation:** CC-BY-4.0 license by Michael J. Bommarito II **Component Licenses:** - Linux binaries: Various open-source licenses (GPL, LGPL, MIT, BSD, Apache) - Windows binaries: Subject to Microsoft software licenses - SOREL-20M samples: Follow [SOREL-20M License Agreement](https://github.com/sophos-ai/SOREL-20M) - Malware Bazaar samples: Research use only, attribution required to [abuse.ch](https://abuse.ch/) **Malware samples** are included for research purposes only. Users must comply with applicable laws and regulations when working with malware samples. ## 📖 Citation If you use this dataset in your research, please cite: ```bibtex @dataset{bommarito2025binary30k, title={Binary-30K: A Cross-Platform, Multi-Architecture Binary Dataset with Stratified Splits}, author={Bommarito, Michael J., II}, year={2025}, publisher={HuggingFace}, url={https://huggingface.co/datasets/mjbommar/binary-30k} } ``` ## 🔗 Related Resources - **Original dataset (no splits)**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) - **Tokenizer**: [`mjbommar/binary-tokenizer-001-64k`](https://huggingface.co/mjbommar/binary-tokenizer-001-64k) - **Paper**: [Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection](https://huggingface.co/papers/2511.22095) - **Code & Documentation**: [github.com/mjbommar/binary-dataset-paper](https://github.com/mjbommar/binary-dataset-paper) - **Technical Documentation**: See [DATASET_SPLITS.md](https://github.com/mjbommar/binary-dataset-paper/blob/master/DATASET_SPLITS.md) for detailed stratification methodology ## 📞 Contact **Author:** Michael J. Bommarito II **Email:** michael.bommarito@gmail.com ## 🔄 Updates - **2025-11-15**: Initial release with stratified train/val/test splits (70/15/15) --- *Last Updated: November 15, 2025*
提供机构:
HandsomeWin
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作