LCC-LLM: Leveraging Code-Centric Dataset for Large Language Models Malware Family Attribution

Name: LCC-LLM: Leveraging Code-Centric Dataset for Large Language Models Malware Family Attribution
Creator: KAUST Research Repository
Published: 2026-04-26 12:30:17
License: 暂无描述

DataCite Commons2026-04-26 更新2026-05-03 收录

下载链接：

https://repository.kaust.edu.sa/handle/10754/709465

下载链接

链接失效反馈

官方服务：

资源简介：

The **Large-scale Code-Centric Dataset (LCCD)** is a malware analysis dataset containing ~34,700 binary samples with deep static analysis, AI-generated analysis, decompiled code, control flow graphs, threat intelligence data, and pre-built training data for machine learning. LCC-LLM is a comprehensive code-centric dataset designed to support Large Language Model (LLM)-based malware family attribution. The dataset includes decompiled C code, assembly instructions, function call graphs (FCGs), hex dumps, and rich metadata for both malware and benign executables, enabling advanced research in malware understanding, cyber threat intelligence, and AI-driven cybersecurity. # Directory Structure ``` LCCD_DATASET_JSONL/ README.md # This file samples/ # 227 shards (~22 GB) - core binary analysis data training_data/ # 3 shards (~100 MB) - LLM training examples graph_nodes/ # 8 shards (~450 MB) - knowledge graph metadata/ # 1 shard (<1 KB) - dataset version info ``` ## Data Format Notes - **Compression**: Zstandard (level 19). Decompress with the `zstandard` Python library or the `zstd` CLI tool. - **MongoDB `_id` fields** have been removed from all documents. - **GridFS fields**: The `samples` collection originally stored large fields (assembly code, AI analysis, etc.) via MongoDB GridFS. These have been reassembled and inlined directly into each sample document for portability. - **Binary data**: The `binary_sample` field in `samples` contains base64-encoded binary content. This field is **truncated** (not the full binary) to keep file sizes manageable. - **Datetime fields**: Stored as ISO 8601 strings or MongoDB extended JSON datetime objects. ## Data Sources The samples in this dataset were collected from two public sources: - **[DikeDataset](https://github.com/iosifache/DikeDataset)** - A curated dataset of benign and malicious PE files for malware research. - **[MalwareBazaar](https://bazaar.abuse.ch/)** - A public repository of malware samples maintained by abuse.ch. MalwareBazaar samples span **January 2022 to January 2026**. DikeDataset samples are partially sourced from an earlier dataset published in 2018 (exact collection period unspecified by the original authors). All analysis (disassembly, decompilation, static analysis, AI analysis, CTI enrichment, etc.) was performed as part of the LCCD pipeline after collection. ## Scope & Limitations - **Platform**: Windows only. All samples are **PE (Portable Executable)** files. - **No ELF, Mach-O, APK, or other formats** are included. - **Benign and malicious**: The dataset includes both benign and malicious samples (see `is_malicious` and `malware_category` fields). - **Analysis tooling bias**: Static analysis results depend on the tools used (Ghidra, Radare2, etc.). Different tools may produce different outputs for the same binary. - **AI analysis is model-dependent**: AI-generated fields were produced by LLMs at a specific point in time and may contain inaccuracies or hallucinations. - **Temporal bias**: MalwareBazaar samples cover 2022-2026; DikeDataset samples originate from a 2018 publication with an unspecified collection period. The dataset will not include threats emerging after January 2026. ## Intended Use - Training and evaluating ML models for malware classification, detection, and analysis. - Fine-tuning LLMs on cybersecurity and reverse engineering tasks (using the `training_data` collection). - Research on malware behavior, family clustering, and threat intelligence correlation. - Building and benchmarking static analysis pipelines. **Out-of-scope uses**: This dataset is intended for defensive security research and education. It should not be used to develop offensive malware or to facilitate attacks. ## Ethical Considerations - This dataset was built from **real-world malware samples** sourced from public repositories (DikeDataset and MalwareBazaar). - **No full binaries are included**. The `binary_sample` field contains only a truncated, base64-encoded fragment of each file — not enough to reconstruct or execute the original binary. - The dataset contains decompiled code, disassembly, and behavioral analysis of malicious software. While this information is valuable for research, it could theoretically inform malicious actors. This is consistent with existing public resources (VirusTotal, MalwareBazaar, etc.) that share similar analysis data openly to advance collective defense. - Threat intelligence fields (`bazaar_data`, `cti`, `misp_data`) may reference real-world threat actors, campaigns, or infrastructure. Download All files <b> Link to Globus download directory</b> For download instructions and more information about GlobusRetrieve Data: Large Files (from Globus)

提供机构：

KAUST Research Repository

创建时间：

2026-04-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集