five

ASE2021 vulnerability fix dataset

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/5513050
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset of "Finding A Needle in a Haystack: Automated Mining of Silent Vulnerability Fixes", which was accepted in the 36th IEEE/ACM Automated Software Engineering (ASE) Conference. Followings are the descriptions of columns: commit_id: The commit ID/hash. repo: The Github Author and repository (e.g., "apache/hive"). filename: The name of the file changed in the commit. partition: Which dataset the commit information belongs to (i.e., "train", "val", or "test"). PL: Programming Language (PL) (i.e., "java" or "py"). label: Label of the commit, 0 for non-vulnerability fixing commit and 1 for vulnerability fixing commit. diff: The entire code change information of the file in this commit. committer_date: The date of the commit (e.g., 2015-03-02 13:48:25+13:00) msg: The commit message (NA if empty). MOD_DIFF: The code change of the file in this commit after preprocessing: filtering out lines that are not added lines or removed lines, and removing refactoring information and comments. BPE_MOD_DIFF: BPE processing applied to MOD_DIFF information (using codeprep Python package). ADD_DIFF: The added lines from the MOD_DIFF information (indicated as a line starting with '+' character). REM_DIFF: The removed lines from the MOD_DIFF information (indicated as a line starting with '-' character). LOC_ADD: Total lines of code added in this file change. LOC_REM: Total lines of code removed in this file change. LOC_MOD: Total lines of code modified in this file change (LOC_ADD + LOC_REM). commit_repo: The commit ID and repository concatenated. cve_list: A list of CVEs which the commit fixes (e.g., CVE-2015-5348, CVE-2016-8902). Following is the code snippet to reproduce Table 1. import pandas as pd all_commits = pd.read_csv('./ase_dataset_sept_19_2021.csv') #Separate by language, since the Java commits are missing some info which we will add later on. py = all_commits[all_commits.PL == 'python'] java = all_commits[all_commits.PL == 'java'] #Java first: partition into train/val/test and check # of commits print("Java VF vs NVF for train/val/test") java_train = java[java.partition =="train"] java_val = java[java.partition == "val"] java_test = java[java.partition == "test"] print(java_train.drop_duplicates(subset='commit_id').label.value_counts()) print(java_val.drop_duplicates(subset='commit_id').label.value_counts()) print(java_test.drop_duplicates(subset='commit_id').label.value_counts()) #Python: partition into train/val/test and check # of commits print("Py VF vs NVF for train/val/test") py_train = py[py.partition =="train"] py_val = py[py.partition == "val"] py_test = py[py.partition == "test"] print(py_train.drop_duplicates(subset='commit_id').label.value_counts()) print(py_val.drop_duplicates(subset='commit_id').label.value_counts()) print(py_test.drop_duplicates(subset='commit_id').label.value_counts())
创建时间:
2023-03-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作