Vulnerability prediction using pre-trained models: An empirical evaluation [Dataset]

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/15082635

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains the extension of a publicly available dataset that was published initially by Bagheri et al. in their paper: A. Bagheri and P. Hegedűs, "A comparison of different source code representation methods for vulnerability prediction in python", Quality of Information and Communications Technology, 2021. This dataset is an extension of the dataset presented by Bagheri et al., who used a version control system as a data source for collecting source code components. Specifically, they used GitHub since it has a high number of software projects. To create a labeled dataset, i.e., a dataset of files signed with a label that declares if they are vulnerable or not, they scanned the commit messages in Python GitHub projects. In particular, they searched for commits, which contain vulnerability-fixing keywords in the commit message. They gathered a large number of Python source files included in such commits. The version of each file before the vulnerability-fixing commit (i.e., parent version) is considered vulnerable, since it contains the vulnerability that required a patch, whereas the version of the file in the vulnerability-fixing commit is considered non-vulnerable. However, in their study, Bagheri et al. utilized only the fragment of the diff file, which contains the difference between the vulnerable and the fixed version, and they proposed models to separate the “bad” and the “good” parts of a file. In the current study, we extend their dataset by collecting clean (i.e., non-vulnerable) versions from GitHub. For this purpose, we retrieved files from the latest version of the dataset’s GitHub repositories, since the latest versions are the safest versions that can be considered non-vulnerable because no vulnerabilities have yet been reported for them. Hence, we can construct models to perform vulnerability prediction at the file-level of granularity. Overall, the extended dataset contains 4,184 Python files, 3,186 of which are considered vulnerable and 998 are considered neutral (i.e., non-vulnerable).

创建时间：

2025-03-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集