MalwareBench: Malware samples are not enough
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10573493
下载链接
链接失效反馈官方服务:
资源简介:
The prevalent use of third-party components in modern software development, coupled with rapid modernization and digitization, has significantly amplified the risk of software supply chain security attacks. Popular large registries like npm and PyPI are highly targeted malware distribution channels for attackers due to their heavy growth and dependence on third-party components. Industry and academia are working towards building tools to detect malware in the software supply chain. However, a lack of benchmark datasets containing both malware and neutral packages hampers the evaluation of the performance of these malware detection tools. The goal of our study is to aid researchers and tool developers in evaluating and improving malware detection tools by contributing a benchmark dataset built by systematically collecting malicious and neutral packages from the npm and PyPI ecosystems.
MalwareBench is a labeled dataset aimed at aiding researchers and tool developers in evaluating and improving malware detection tools. It comprises 20,792 packages (of which 6,659 are malicious) collected systematically from the npm and PyPI ecosystems. The dataset is constructed by amalgamating pre-existing malware datasets with Socket's internal benchmark data and incorporating both popular and newly released packages.
Description of the data and file structure
MalwareBench included malicious and neutral packages containing package names, versions, release types, and the ground truth of the package. We annotated the ground truth label of packages in two groups:
Malware: Packages that are written intentionally to carry out harmful actions and intended to perform an unauthorized process that will have an adverse impact on the confidentiality, integrity, or availability of a system.
Neutral: Packages with no discovered malware.
In addition, we included additional metadata for packages, including the file path, file size, total number of files, file size, package size, file extension, and package group.
Sharing/Access information
We added a sample of our CSV file here. However, a complete dataset is hosted on GitHub. Since our dataset contains malware from the real world, some packages may contain sensitive information. We will distribute the dataset upon reasonable request based on ethical considerations of the purpose for using the data. After evaluating the reason for using the dataset, we will provide GitHub access. Please send an email to the authors to access the dataset.
The data was derived from the following sources:
Backstabber’s knife collection: A review of open source software supply chain attacks.
Towards measuring supply chain attacks on package managers for interpreted languages.
An Empirical Study of Malicious Code In PyPI Ecosystem.
Open-Source Dataset of Malicious Software Packages
Socket Internal benchmark Dataset
创建时间:
2024-03-31



