Malware Repositories and Their Authors on GitHub

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/10806592

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity. Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories. Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem. We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label. malware_repos.txt Purpose: This file contains a curated list of GitHub repositories identified as containing malware. These repositories were identified following the methodology outlined in the research paper "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub." Contents: The file is structured as a simple text file, with each line representing a unique repository in the format username/reponame. This format allows for easy identification and access to each repository on GitHub for further analysis or review. Usage: The list serves as a critical resource for researchers and cybersecurity professionals interested in studying malware, understanding its distribution on platforms like GitHub, or developing defense mechanisms against such malicious content. obfuscated_github_user_dataset.csv Purpose: Accompanying the list of malware repositories, this CSV file contains detailed, albeit obfuscated, profile information of the GitHub users who authored these repositories. The obfuscation process has been applied to protect user privacy and comply with ethical standards, especially given the sensitive nature of associating individuals with potentially malicious activities. Contents: The dataset includes several columns representing different aspects of user profiles, such as obfuscated identifiers (e.g., ID, login, name), contact information (e.g., email, blog), and GitHub-specific metrics (e.g., followers count, number of public repositories). Notably, sensitive information has been masked or replaced with generic placeholders to prevent user identification. Usage: This dataset can be instrumental for researchers analyzing behaviors, patterns, or characteristics of users involved in creating malware repositories on GitHub. It provides a basis for statistical analysis, trend identification, or the development of predictive models, all while upholding the necessary ethical considerations.

本数据集源于一项旨在揭示GitHub平台恶意软件仓库创建的根源与动机的研究。本研究创新性地展开分析，深入拆解参与此类可疑活动的GitHub用户的档案与意图。本研究采用严谨的研究方法，精准识别出14000名与恶意软件仓库相关联的GitHub用户。借助先进的大语言模型（Large Language Model, LLM）分析技术，我们根据用户的可感知意图将其划分为不同类别：3339名被归类为恶意用户，3354名被归类为疑似恶意用户，7574名被归类为良性用户，为该类仓库背后的社区生态提供了细致入微的洞察。本研究打破了此类GitHub档案常伴有的匿名与模糊性面纱，揭示了用户群体特征间的显著差异。研究发现，恶意用户的档案通常较为单薄，且活动集中于恶意行为；而良性用户则拥有完善全面的档案，积极参与网络安全教育与研究工作。被归类为疑似恶意的用户则呈现出差异化的参与程度，凸显了这一数字生态系统的复杂性与多样性。本研究共提供两类数据集：其一为恶意软件仓库列表，我们参照已有研究，于2022年收集并扩充了GitHub平台上的恶意软件仓库资源；其二为包含GitHub用户信息及其恶意性分类标签的CSV文件。 --- ### malware_repos.txt #### 用途：本文件包含经筛选的GitHub恶意软件仓库列表，此类仓库的识别依据《SourceFinder：从GitHub公开仓库中挖掘恶意软件源代码》一文所提出的研究方法完成。 #### 内容：本文件为纯文本格式，每行以`username/reponame`的格式对应一个唯一的GitHub仓库，该格式可便捷地用于识别并访问对应仓库以开展后续分析或审查工作。 #### 用途说明：本列表可为研究人员与网络安全从业者提供关键资源，用于开展恶意软件相关研究、剖析其在GitHub等平台的传播态势，或研发针对此类恶意内容的防御机制。 --- ### obfuscated_github_user_dataset.csv #### 用途：本CSV文件配套恶意软件仓库列表，包含创建此类仓库的GitHub用户的详细档案信息（已做混淆处理）。为保护用户隐私并符合伦理规范，针对可能将个体与恶意活动关联的敏感场景，本数据集已执行混淆处理流程。 #### 内容：本数据集包含多列信息，涵盖用户档案的多个维度，例如混淆后的身份标识（如ID、登录名、真实姓名）、联系方式（如邮箱、个人博客）以及GitHub平台专属指标（如粉丝数、公开仓库数量）。值得注意的是，所有敏感信息均已做掩码处理或替换为通用占位符，以避免用户身份被识别。 #### 用途说明：本数据集可辅助研究人员分析在GitHub平台创建恶意软件仓库的用户的行为、模式与特征，为统计分析、趋势识别或预测模型开发提供数据基础，同时严格遵循相关伦理准则。

创建时间：

2024-03-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集