MalwareBench: Malware samples are not enough
收藏Mendeley Data2024-05-10 更新2024-06-28 收录
下载链接:
https://zenodo.org/records/10573494
下载链接
链接失效反馈官方服务:
资源简介:
The prevalent use of third-party components in modern software development, coupled with rapid modernization and digitization, has significantly amplified the risk of software supply chain security attacks. Popular large registries like npm and PyPI are highly targeted malware distribution channels for attackers due to their heavy growth and dependence on third-party components. Industry and academia are working towards building tools to detect malware in the software supply chain. However, a lack of benchmark datasets containing both malware and neutral packages hampers the evaluation of the performance of these malware detection tools. The goal of our study is to aid researchers and tool developers in evaluating and improving malware detection tools by contributing a benchmark dataset built by systematically collecting malicious and neutral packages from the npm and PyPI ecosystems. MalwareBench is a labeled dataset aimed at aiding researchers and tool developers in evaluating and improving malware detection tools. It comprises 20,792 packages (of which 6,659 are malicious) collected systematically from the npm and PyPI ecosystems. The dataset is constructed by amalgamating pre-existing malware datasets with Socket's internal benchmark data and incorporating both popular and newly released packages. Description of the data and file structure MalwareBench included malicious and neutral packages containing package names, versions, release types, and the ground truth of the package. We annotated the ground truth label of packages in two groups: Malware: Packages that are written intentionally to carry out harmful actions and intended to perform an unauthorized process that will have an adverse impact on the confidentiality, integrity, or availability of a system. Neutral: Packages with no discovered malware. In addition, we included additional metadata for packages, including the file path, file size, total number of files, file size, package size, file extension, and package group. Sharing/Access information We added a sample of our CSV file here. However, a complete dataset is hosted on GitHub. Since our dataset contains malware from the real world, some packages may contain sensitive information. We will distribute the dataset upon reasonable request based on ethical considerations of the purpose for using the data. After evaluating the reason for using the dataset, we will provide GitHub access. Please send an email to the authors to access the dataset. The data was derived from the following sources: Backstabber’s knife collection: A review of open source software supply chain attacks. Towards measuring supply chain attacks on package managers for interpreted languages. An Empirical Study of Malicious Code In PyPI Ecosystem. Open-Source Dataset of Malicious Software Packages Socket Internal benchmark Dataset
在现代软件开发中,第三方组件的广泛应用,加之快速的现代化与数字化进程,大幅加剧了软件供应链安全攻击的风险。诸如npm和PyPI这类主流大型软件注册仓库,因自身规模快速扩张且高度依赖第三方组件,已成为攻击者分发恶意软件的高针对性渠道。工业界与学术界正致力于构建能够检测软件供应链中恶意软件的工具。然而,当前缺乏同时包含恶意软件包与合法中性软件包的基准数据集,这严重阻碍了对这些恶意软件检测工具性能的评估。
本研究旨在通过系统收集npm与PyPI生态中的恶意软件包与中性软件包,构建基准数据集,以助力研究人员与工具开发者评估并优化恶意软件检测工具。MalwareBench是一款带标注的基准数据集,专为辅助研究人员与工具开发者评估、改进恶意软件检测工具而设计。该数据集共包含20792个软件包(其中6659个为恶意软件包),均从npm与PyPI生态中系统性采集得到。本数据集通过整合现有恶意软件数据集与Socket内部基准数据,并纳入热门及新发布的软件包构建而成。
### 数据与文件结构说明
MalwareBench包含恶意软件包与中性软件包,其中涵盖软件包名称、版本、发布类型以及软件包的真实标注(ground truth)。我们将软件包的真实标注分为两类:
- 恶意软件(Malware):旨在执行有害操作、意图进行未授权进程,进而对系统的机密性、完整性或可用性造成负面影响的软件包。
- 中性软件包(Neutral):未发现任何恶意代码的软件包。
此外,我们还为软件包补充了元数据,包括文件路径、文件大小、总文件数、包大小、文件扩展名以及软件包分组。
### 共享与访问说明
我们在此处附上了本数据集的CSV格式样本。完整数据集托管于GitHub平台。由于本数据集包含真实世界中的恶意软件,部分软件包可能包含敏感信息。基于数据使用目的的伦理考量,我们将根据合理申请分发该数据集:在评估您的数据集使用理由后,我们将提供GitHub访问权限。如需获取数据集,请向本文作者发送邮件申请。
本数据集源自以下数据源:
1. 《Backstabber’s knife collection: A review of open source software supply chain attacks》
2. 《Towards measuring supply chain attacks on package managers for interpreted languages》
3. 《An Empirical Study of Malicious Code In PyPI Ecosystem》
4. Open-Source Dataset of Malicious Software Packages
5. Socket Internal benchmark Dataset
创建时间:
2024-01-29
搜集汇总
数据集介绍

背景与挑战
背景概述
MalwareBench是一个用于评估和改进恶意软件检测工具的基准数据集,专注于npm和PyPI生态系统,包含20,792个包(其中6,659个为恶意包),通过整合现有恶意软件数据集和内部数据构建,提供包名、版本、真实标签等元数据,旨在支持研究者和工具开发者进行性能评估。
以上内容由遇见数据集搜集并总结生成



