You Can't Touch This: Detecting Typosquatting Packages for Enhanced Malware Prevention in Software Supply Chains

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14907785

下载链接

链接失效反馈

官方服务：

资源简介：

Short Summary This repository includes the following datasets: Own Dataset: A collection of 394 with source code typosquatting packages we have collected based on SonaType, Phylum.io and Snyk listings. Backstabbers Knife Collection: A snapshot of Backstabbers Knife Collection during our analysis for reproduction purposes MalOSS: A snapshot of the MalOSS dataset during our analysis for reproduction purposes. Source code: The source code of our programs and algorithms, mainly the Random Forest models, and the Extended Damerau-Levenshtein MetricHowever, the source code of the packages provided by MalOSS and Backstabbers Knife Collection must be retrieved by the corresponding owner/maintainer. AbstractIn recent years, typosquatting has become a significant threat to software supply chain systems, where malicious packages deceptively mimic legitimate ones. Attackers register these fraudulent packages with names strikingly similar to those of legitimate packages. As a result, de- velopers can mistakenly download these malicious packages by mistyping the intended package name or selecting a package based on its convincing yet deceptive name. In this paper, we assess the effectiveness of string-matching algorithms in identifying potential typosquatting candidates. We construct an open dataset comprising 394 typosquatting packages and evaluate the perfor- mance of these algorithms based on their ability to detect typosquatting packages. In addition, we introduce a novel string-matching algorithm, an extension of the Damerau-Levenshtein distance, demonstrating a no- tably higher true-positive rate than existing methods. Since our dataset contains features not previously considered, we also investigate how these new features affect the assignment accuracy of ML-based classifiers. Our results show an overall accuracy rate of 98.4% on our datasets and 96.0% and 93.5% accuracy on evaluating two other open datasets. These results provide valuable insights for researchers, package manager vendors, and developers to improve their understanding of malicious typosquatting packages and improve mediation strategies and technologies.

创建时间：

2025-03-13