xamxte/cve-to-cwe
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/xamxte/cve-to-cwe
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-classification
language:
- en
tags:
- cybersecurity
- vulnerability
- cwe
- cve
- nvd
- mitre-attack
pretty_name: CVE-to-CWE + ATT&CK Classification Dataset
size_categories:
- 100K<n<1M
---
# CVE-to-CWE + ATT&CK Classification Dataset
A dataset for mapping CVE (Common Vulnerabilities and Exposures) descriptions to CWE (Common Weakness Enumeration) categories and MITRE ATT&CK techniques. Built from the National Vulnerability Database (NVD) with AI-assisted label refinement.
## Tasks
1. **CVE → CWE classification** (single-label, 205 classes) — map vulnerability description to weakness type
2. **CVE → ATT&CK technique mapping** (multi-label, 361 techniques) — map vulnerability to attack techniques
## Dataset Summary
| Split | Samples | CWE Label Source | ATT&CK Coverage |
|-------|---------|-----------------|-----------------|
| Train | 234,770 | Claude Sonnet 4.6 relabeled | 97.2% |
| Validation | 27,896 | Agreement-filtered (NVD == Sonnet) | 98.2% |
| Test | 27,780 | Agreement-filtered (NVD == Sonnet) | 98.2% |
- **CWE classes:** 205
- **ATT&CK techniques:** 361 unique (multi-label per sample)
- **Years covered:** 1999–2026
- **Source:** NVD (National Vulnerability Database)
## Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `cve_id` | string | CVE identifier (e.g., "CVE-2024-12345") |
| `description` | string | Vulnerability description from NVD |
| `cwe_id` | string | CWE category (e.g., "CWE-79") |
| `label` | int | Numeric CWE label ID (0–204), see `label_map.json` |
| `attack_techniques` | list[string] | MITRE ATT&CK technique IDs (e.g., ["T1190", "T1059.007"]) |
## CWE Label Quality
The original NVD CWE labels are known to be noisy (often too generic, e.g., CWE-20 "Improper Input Validation" used as catch-all). To improve label quality:
1. All 318,979 CVE descriptions were relabeled using Claude Sonnet 4.6 via the Anthropic Batch API (~$395 total cost)
2. **73.1% exact CWE ID agreement** between NVD and Sonnet labels (84.5% with hierarchy-aware matching, indicating ~half of disagreements are granularity differences)
3. **Validation and test sets** contain only agreement-filtered samples where NVD and Sonnet labels match exactly
4. **Training set** uses Sonnet labels for all samples (including disagreements and previously unlabeled CVEs)
This means val/test are a high-confidence subset where two independent labelers agree, but biased toward unambiguous cases — samples where labelers disagree are excluded.
**Manual validation:** A random sample of 100 NVD-Sonnet disagreements was manually reviewed — Sonnet was clearly more accurate in 72% of cases, NVD in only 3%, with the remainder being ambiguous or both-acceptable (hierarchy/sibling CWEs).
## Top 20 CWE Classes
| CWE | Name | Train Count | % |
|-----|------|------------|---|
| CWE-79 | Cross-site Scripting | 33,858 | 14.4% |
| CWE-89 | SQL Injection | 15,619 | 6.7% |
| CWE-22 | Path Traversal | 8,047 | 3.4% |
| CWE-121 | Stack-based Buffer Overflow | 7,651 | 3.3% |
| CWE-862 | Missing Authorization | 7,533 | 3.2% |
| CWE-78 | OS Command Injection | 7,132 | 3.0% |
| CWE-125 | Out-of-bounds Read | 6,770 | 2.9% |
| CWE-200 | Information Exposure | 6,516 | 2.8% |
| CWE-787 | Out-of-bounds Write | 6,508 | 2.8% |
| CWE-20 | Improper Input Validation | 6,299 | 2.7% |
| CWE-352 | CSRF | 6,270 | 2.7% |
| CWE-416 | Use After Free | 6,009 | 2.6% |
| CWE-119 | Buffer Overflow | 5,943 | 2.5% |
| CWE-400 | Resource Exhaustion | 5,809 | 2.5% |
| CWE-284 | Improper Access Control | 5,270 | 2.2% |
| CWE-476 | NULL Pointer Dereference | 4,931 | 2.1% |
| CWE-122 | Heap-based Buffer Overflow | 4,787 | 2.0% |
| CWE-434 | Unrestricted Upload | 3,697 | 1.6% |
| CWE-306 | Missing Authentication | 3,313 | 1.4% |
| CWE-190 | Integer Overflow | 3,210 | 1.4% |
## Top 15 ATT&CK Techniques
| Technique | Name | Count |
|-----------|------|-------|
| T1190 | Exploit Public-Facing Application | 127,837 |
| T1203 | Exploitation for Client Execution | 45,480 |
| T1499 | Endpoint Denial of Service | 30,344 |
| T1068 | Exploitation for Privilege Escalation | 25,481 |
| T1059.007 | JavaScript | 24,185 |
| T1059 | Command and Scripting Interpreter | 19,199 |
| T1005 | Data from Local System | 18,887 |
| T1552 | Unsecured Credentials | 10,495 |
| T1078 | Valid Accounts | 6,285 |
| T1557 | Adversary-in-the-Middle | 4,667 |
| T1189 | Drive-by Compromise | 3,955 |
| T1110 | Brute Force | 2,249 |
| T1083 | File and Directory Discovery | 2,076 |
| T1210 | Exploitation of Remote Services | 1,805 |
| T1040 | Network Sniffing | 1,547 |
## Data Decontamination
All 2,000 CVEs from the [CTI-Bench](https://github.com/xashru/cti-bench) benchmark (NeurIPS 2024) have been removed from all splits to enable clean external evaluation.
## Trained Models
- [xamxte/cwe-classifier-roberta-base](https://huggingface.co/xamxte/cwe-classifier-roberta-base) — RoBERTa-base fine-tuned on this dataset. 87.4% top-1 accuracy on agreement-filtered test set, competitive with best open-weight models on [CTI-Bench](https://github.com/xashru/cti-bench) RCM (75.6% strict).
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("xamxte/cve-to-cwe")
# Access splits
train = dataset["train"]
val = dataset["validation"]
test = dataset["test"]
print(train[0])
# {
# 'cve_id': 'CVE-2025-7782',
# 'description': 'The WP JobHunt plugin for WordPress...',
# 'cwe_id': 'CWE-862',
# 'label': 186,
# 'attack_techniques': ['T1190', 'T1059.007']
# }
```
## CWE Hierarchy Note
This dataset uses **specific (child) CWE categories** where possible, rather than generic parent categories. For example, buffer overflow vulnerabilities are labeled as CWE-121 (Stack Buffer Overflow) or CWE-122 (Heap Buffer Overflow) rather than the generic CWE-119 (Buffer Overflow). This provides more actionable information for vulnerability triage.
## Limitations
- **Single-label CWE**: Each CVE is assigned exactly one CWE, though some vulnerabilities may involve multiple weakness types
- **Description-only**: Classification is based solely on the text description; CVSS scores, CPE data, and other metadata are not included
- **English only**: All descriptions are in English (NVD standard)
## Paper
📄 **[Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs](https://arxiv.org/abs/2603.14911)**
## Citation
If you use this dataset, please cite:
```bibtex
@article{mosievskiy2026cwe,
title={Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs},
author={Mosievskiy, Nikita},
journal={arXiv preprint arXiv:2603.14911},
year={2026}
```
## License
CC-BY-4.0
提供机构:
xamxte



