Examining the Capacity of Text Mining and Software Metrics in Vulnerability Prediction [dataset]
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8369962
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains the extension of a publicly available dataset that was published initially by Ferenc et al. in their paper:
“Ferenc, R.; Hegedus, P.; Gyimesi, P.; Antal, G.; Bán, D.; Gyimóthy, T. Challenging machine learning algorithms in predicting vulnerable javascript functions. 2019 IEEE/ACM 7th InternationalWorkshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). IEEE, 2019, pp. 8–14.”
The dataset contained software metrics for source code functions written in JavaScript (JS) programming language. Each function was labeled as vulnerable or clean. The authors gathered vulnerabilities from publicly available vulnerability databases.
In our paper entitled: “Examining the Capacity of Text Mining and Software Metrics in Vulnerability Prediction” and cited as:
“Kalouptsoglou I, Siavvas M, Kehagias D, Chatzigeorgiou A, Ampatzoglou A. Examining the Capacity of Text Mining and Software Metrics in Vulnerability Prediction. Entropy. 2022; 24(5):651. https://doi.org/10.3390/e24050651”
, we presented an extended version of the dataset by extracting textual features for the labeled JS functions. In particular, we got the dataset provided by Ferenc et al. in CSV format and then we gathered all the GitHub URLs of the dataset's functions (i.e., methods). Using these URLs, we collected the source code of the corresponding JS files from GitHub. Subsequently, by utilizing the start and end line information for every function, we cut off the code of the functions. Each function was then tokenized to construct a list of tokens per function.
To extract text features, we used a text mining technique called sequences of tokens. As a result, we created a repository with all methods' source code, the token sequences of each method, and their labels. To boost the generalizability of type-specific tokens, all comments were eliminated, as well as all integers and strings, which were replaced with two unique IDs.
The dataset contains 12,106 JavaScript functions, from which 1,493 are considered vulnerable.
This dataset was created and utilized during the Vulnerability Prediction Task of the Horizon2020 IoTAC Project as training and evaluation data for the construction of vulnerability prediction models. The dataset is provided in the csv format. Each row of the csv file has the following parts:
Label: Flag with values ‘1’ for vulnerable and ‘0’ for non-vulnerable methods
Name: The name of the JavaScript method
Longname: The longname of the JavaScript method
Path: The path of the file of the method in the repository
Full_repo_path: The GitHub URL of the file of the method
TokenX: Each next row corresponds to each token included in the method
创建时间:
2023-09-29



