five

xamxte/cve-to-cwe

收藏
Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/xamxte/cve-to-cwe
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-classification language: - en tags: - cybersecurity - vulnerability - cwe - cve - nvd - mitre-attack pretty_name: CVE-to-CWE + ATT&CK Classification Dataset size_categories: - 100K<n<1M --- # CVE-to-CWE + ATT&CK Classification Dataset A dataset for mapping CVE (Common Vulnerabilities and Exposures) descriptions to CWE (Common Weakness Enumeration) categories and MITRE ATT&CK techniques. Built from the National Vulnerability Database (NVD) with AI-assisted label refinement. ## Tasks 1. **CVE → CWE classification** (single-label, 205 classes) — map vulnerability description to weakness type 2. **CVE → ATT&CK technique mapping** (multi-label, 361 techniques) — map vulnerability to attack techniques ## Dataset Summary | Split | Samples | CWE Label Source | ATT&CK Coverage | |-------|---------|-----------------|-----------------| | Train | 234,770 | Claude Sonnet 4.6 relabeled | 97.2% | | Validation | 27,896 | Agreement-filtered (NVD == Sonnet) | 98.2% | | Test | 27,780 | Agreement-filtered (NVD == Sonnet) | 98.2% | - **CWE classes:** 205 - **ATT&CK techniques:** 361 unique (multi-label per sample) - **Years covered:** 1999–2026 - **Source:** NVD (National Vulnerability Database) ## Data Fields | Field | Type | Description | |-------|------|-------------| | `cve_id` | string | CVE identifier (e.g., "CVE-2024-12345") | | `description` | string | Vulnerability description from NVD | | `cwe_id` | string | CWE category (e.g., "CWE-79") | | `label` | int | Numeric CWE label ID (0–204), see `label_map.json` | | `attack_techniques` | list[string] | MITRE ATT&CK technique IDs (e.g., ["T1190", "T1059.007"]) | ## CWE Label Quality The original NVD CWE labels are known to be noisy (often too generic, e.g., CWE-20 "Improper Input Validation" used as catch-all). To improve label quality: 1. All 318,979 CVE descriptions were relabeled using Claude Sonnet 4.6 via the Anthropic Batch API (~$395 total cost) 2. **73.1% exact CWE ID agreement** between NVD and Sonnet labels (84.5% with hierarchy-aware matching, indicating ~half of disagreements are granularity differences) 3. **Validation and test sets** contain only agreement-filtered samples where NVD and Sonnet labels match exactly 4. **Training set** uses Sonnet labels for all samples (including disagreements and previously unlabeled CVEs) This means val/test are a high-confidence subset where two independent labelers agree, but biased toward unambiguous cases — samples where labelers disagree are excluded. **Manual validation:** A random sample of 100 NVD-Sonnet disagreements was manually reviewed — Sonnet was clearly more accurate in 72% of cases, NVD in only 3%, with the remainder being ambiguous or both-acceptable (hierarchy/sibling CWEs). ## Top 20 CWE Classes | CWE | Name | Train Count | % | |-----|------|------------|---| | CWE-79 | Cross-site Scripting | 33,858 | 14.4% | | CWE-89 | SQL Injection | 15,619 | 6.7% | | CWE-22 | Path Traversal | 8,047 | 3.4% | | CWE-121 | Stack-based Buffer Overflow | 7,651 | 3.3% | | CWE-862 | Missing Authorization | 7,533 | 3.2% | | CWE-78 | OS Command Injection | 7,132 | 3.0% | | CWE-125 | Out-of-bounds Read | 6,770 | 2.9% | | CWE-200 | Information Exposure | 6,516 | 2.8% | | CWE-787 | Out-of-bounds Write | 6,508 | 2.8% | | CWE-20 | Improper Input Validation | 6,299 | 2.7% | | CWE-352 | CSRF | 6,270 | 2.7% | | CWE-416 | Use After Free | 6,009 | 2.6% | | CWE-119 | Buffer Overflow | 5,943 | 2.5% | | CWE-400 | Resource Exhaustion | 5,809 | 2.5% | | CWE-284 | Improper Access Control | 5,270 | 2.2% | | CWE-476 | NULL Pointer Dereference | 4,931 | 2.1% | | CWE-122 | Heap-based Buffer Overflow | 4,787 | 2.0% | | CWE-434 | Unrestricted Upload | 3,697 | 1.6% | | CWE-306 | Missing Authentication | 3,313 | 1.4% | | CWE-190 | Integer Overflow | 3,210 | 1.4% | ## Top 15 ATT&CK Techniques | Technique | Name | Count | |-----------|------|-------| | T1190 | Exploit Public-Facing Application | 127,837 | | T1203 | Exploitation for Client Execution | 45,480 | | T1499 | Endpoint Denial of Service | 30,344 | | T1068 | Exploitation for Privilege Escalation | 25,481 | | T1059.007 | JavaScript | 24,185 | | T1059 | Command and Scripting Interpreter | 19,199 | | T1005 | Data from Local System | 18,887 | | T1552 | Unsecured Credentials | 10,495 | | T1078 | Valid Accounts | 6,285 | | T1557 | Adversary-in-the-Middle | 4,667 | | T1189 | Drive-by Compromise | 3,955 | | T1110 | Brute Force | 2,249 | | T1083 | File and Directory Discovery | 2,076 | | T1210 | Exploitation of Remote Services | 1,805 | | T1040 | Network Sniffing | 1,547 | ## Data Decontamination All 2,000 CVEs from the [CTI-Bench](https://github.com/xashru/cti-bench) benchmark (NeurIPS 2024) have been removed from all splits to enable clean external evaluation. ## Trained Models - [xamxte/cwe-classifier-roberta-base](https://huggingface.co/xamxte/cwe-classifier-roberta-base) — RoBERTa-base fine-tuned on this dataset. 87.4% top-1 accuracy on agreement-filtered test set, competitive with best open-weight models on [CTI-Bench](https://github.com/xashru/cti-bench) RCM (75.6% strict). ## Usage ```python from datasets import load_dataset dataset = load_dataset("xamxte/cve-to-cwe") # Access splits train = dataset["train"] val = dataset["validation"] test = dataset["test"] print(train[0]) # { # 'cve_id': 'CVE-2025-7782', # 'description': 'The WP JobHunt plugin for WordPress...', # 'cwe_id': 'CWE-862', # 'label': 186, # 'attack_techniques': ['T1190', 'T1059.007'] # } ``` ## CWE Hierarchy Note This dataset uses **specific (child) CWE categories** where possible, rather than generic parent categories. For example, buffer overflow vulnerabilities are labeled as CWE-121 (Stack Buffer Overflow) or CWE-122 (Heap Buffer Overflow) rather than the generic CWE-119 (Buffer Overflow). This provides more actionable information for vulnerability triage. ## Limitations - **Single-label CWE**: Each CVE is assigned exactly one CWE, though some vulnerabilities may involve multiple weakness types - **Description-only**: Classification is based solely on the text description; CVSS scores, CPE data, and other metadata are not included - **English only**: All descriptions are in English (NVD standard) ## Paper 📄 **[Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs](https://arxiv.org/abs/2603.14911)** ## Citation If you use this dataset, please cite: ```bibtex @article{mosievskiy2026cwe, title={Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs}, author={Mosievskiy, Nikita}, journal={arXiv preprint arXiv:2603.14911}, year={2026} ``` ## License CC-BY-4.0
提供机构:
xamxte
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作