ASE2021 vulnerability fix dataset
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/5513050
下载链接
链接失效反馈官方服务:
资源简介:
The dataset of "Finding A Needle in a Haystack: Automated Mining of Silent Vulnerability Fixes", which was accepted in the 36th IEEE/ACM Automated Software Engineering (ASE) Conference.
Followings are the descriptions of columns:
commit_id: The commit ID/hash.
repo: The Github Author and repository (e.g., "apache/hive").
filename: The name of the file changed in the commit.
partition: Which dataset the commit information belongs to (i.e., "train", "val", or "test").
PL: Programming Language (PL) (i.e., "java" or "py").
label: Label of the commit, 0 for non-vulnerability fixing commit and 1 for vulnerability fixing commit.
diff: The entire code change information of the file in this commit.
committer_date: The date of the commit (e.g., 2015-03-02 13:48:25+13:00)
msg: The commit message (NA if empty).
MOD_DIFF: The code change of the file in this commit after preprocessing: filtering out lines that are not added lines or removed lines, and removing refactoring information and comments.
BPE_MOD_DIFF: BPE processing applied to MOD_DIFF information (using codeprep Python package).
ADD_DIFF: The added lines from the MOD_DIFF information (indicated as a line starting with '+' character).
REM_DIFF: The removed lines from the MOD_DIFF information (indicated as a line starting with '-' character).
LOC_ADD: Total lines of code added in this file change.
LOC_REM: Total lines of code removed in this file change.
LOC_MOD: Total lines of code modified in this file change (LOC_ADD + LOC_REM).
commit_repo: The commit ID and repository concatenated.
cve_list: A list of CVEs which the commit fixes (e.g., CVE-2015-5348, CVE-2016-8902).
Following is the code snippet to reproduce Table 1.
import pandas as pd
all_commits = pd.read_csv('./ase_dataset_sept_19_2021.csv')
#Separate by language, since the Java commits are missing some info which we will add later on.
py = all_commits[all_commits.PL == 'python']
java = all_commits[all_commits.PL == 'java']
#Java first: partition into train/val/test and check # of commits
print("Java VF vs NVF for train/val/test")
java_train = java[java.partition =="train"]
java_val = java[java.partition == "val"]
java_test = java[java.partition == "test"]
print(java_train.drop_duplicates(subset='commit_id').label.value_counts())
print(java_val.drop_duplicates(subset='commit_id').label.value_counts())
print(java_test.drop_duplicates(subset='commit_id').label.value_counts())
#Python: partition into train/val/test and check # of commits
print("Py VF vs NVF for train/val/test")
py_train = py[py.partition =="train"]
py_val = py[py.partition == "val"]
py_test = py[py.partition == "test"]
print(py_train.drop_duplicates(subset='commit_id').label.value_counts())
print(py_val.drop_duplicates(subset='commit_id').label.value_counts())
print(py_test.drop_duplicates(subset='commit_id').label.value_counts())
创建时间:
2023-03-08



