five

pebblebed/kernel-vuln-dataset-full

收藏
Hugging Face2026-02-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/pebblebed/kernel-vuln-dataset-full
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: gpl-2.0 task_categories: - text-classification language: - en tags: - code - security - vulnerability-detection - linux-kernel - git-commits size_categories: - 1M<n<10M --- # Linux Kernel Vulnerability-Introducing Commits Dataset ## Dataset Description A labeled dataset of **1,426,202 Linux kernel git commits** with full metadata, diffs, and binary labels indicating whether each commit introduced a vulnerability that was later fixed. **Intended use:** Training and evaluating models for vulnerability-introducing commit detection — predicting whether a given code change will later require a security or bug fix. ## How the Data Was Collected 1. **Repository:** The full Linux kernel git history was cloned from `git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git` (all branches and tags, complete history). 2. **Metadata extraction:** A single `git log --all` pass extracted structured metadata for every reachable commit (1,426,202 total), including author/committer info, dates, and the full commit message parsed into subject, body, and trailers. 3. **Diff extraction:** A single `git log --all -p --numstat` pass extracted the complete unified diff and per-file insertion/deletion counts for every commit. 4. **Labeling via Fixes tag mining:** The `vuln_commits_full.csv` dataset maps fixing commits to the commits they fix, identified through the kernel's `Fixes:` trailer convention. Each commit whose abbreviated hash appears as an `introducing_commit` in that mapping receives `label=1`; all others receive `label=0`. 5. **Stratified splitting:** The dataset was split 80/10/10 into train/validation/test with stratified sampling (seed=42) to preserve the label distribution across splits. ## Column Descriptions | Column | Type | Description | |--------|------|-------------| | `hash` | string | Full 40-character commit SHA-1 | | `abbreviated_hash` | string | Short commit hash (typically 12 characters) | | `parent_hashes` | string | Space-separated full hashes of parent commits | | `parent_count` | int32 | Number of parent commits (0=root, 1=normal, 2+=merge) | | `author_name` | string | Original author's name | | `author_email` | string | Original author's email | | `author_date` | string | Author timestamp in strict ISO 8601 format | | `committer_name` | string | Committer's name (person who applied the patch) | | `committer_email` | string | Committer's email | | `committer_date` | string | Committer timestamp in strict ISO 8601 format | | `subject` | string | First line of the commit message | | `body` | string | Commit message text between subject and trailers | | `trailers` | string | Structured trailer lines (Signed-off-by, Fixes, Reviewed-by, Cc, Link, etc.) joined by newlines | | `diff_raw` | large_string | Complete unified diff output (diff headers, hunks, context lines, etc.) | | `insertions` | int64 | Total lines added across all files (from numstat) | | `deletions` | int64 | Total lines removed across all files (from numstat) | | `files_changed` | int32 | Number of files modified | | `label` | int8 | 1 = vulnerability-introducing commit, 0 = clean | ## Split Sizes and Label Distribution | Split | Rows | Label=1 | Label=0 | Positive % | |-------|------|---------|---------|------------| | train | 1,140,962 | 64,002 | 1,076,960 | 5.61% | | validation | 142,620 | 8,000 | 134,620 | 5.61% | | test | 142,620 | 8,000 | 134,620 | 5.61% | | **total** | **1,426,202** | **80,002** | **1,346,200** | **5.61%** | ## Known Limitations 1. **Fixes tags capture all bugs, not just security vulnerabilities.** The kernel's `Fixes:` trailer is used for all bug fixes — logic errors, performance regressions, build failures, and similar — not exclusively for security-critical vulnerabilities. The label therefore reflects "introduced a defect that was later fixed" rather than "introduced a security vulnerability" specifically. 2. **Label noise from catch-all hashes.** Some `Fixes:` tags point to very early commits (e.g., `1da177e4c3f4` — the initial Linux 2.6.12-rc2 import) when the actual introducing commit is unknown or predates git history. These commits receive `label=1` despite not being the true root cause. 3. **Right-censoring of recent commits.** Commits near the HEAD of the repository have had less time to be identified as bug-introducing. Some recent `label=0` commits may in fact contain undiscovered bugs, introducing a systematic false-negative bias toward the end of the timeline. 4. **Merge commits have empty diffs.** By default, `git log -p` does not produce diffs for merge commits. Of the 80,002 label=1 commits, 155 (0.19%) are merges with empty `diff_raw`. These should be excluded or handled specially in modeling. 5. **Feature distribution skew.** Label=1 commits tend to be larger than label=0 commits (mean diff length 3.8x, mean insertions 8.3x). Models may learn to use commit size as a shortcut rather than understanding code semantics. Consider controlling for commit size in evaluation. 6. **Single-project bias.** All data comes from one project (the Linux kernel). Models trained on this data may not generalize to other codebases with different coding conventions, review processes, or commit practices. 7. **Incomplete Fixes tag coverage.** Not all bug-fixing commits in the kernel carry a `Fixes:` trailer. The labeled set of introducing commits is therefore a lower bound — the true number of bug-introducing commits is likely higher, meaning some `label=0` commits are mislabeled. ## Citation If you use this dataset, please cite: ```bibtex @misc{linux_vuln_commits_2026, title={Linux Kernel Vulnerability-Introducing Commits Dataset}, year={2026}, note={Derived from the Linux kernel git history and Fixes-tag mining}, url={https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git} } ``` ## License The Linux kernel is licensed under **GPL-2.0**. Commit metadata and diffs are derived from the kernel source repository and are distributed under the same license.
提供机构:
pebblebed
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作