five

LorenzoNava/cve-cwe-dataset-cleaned

收藏
Hugging Face2025-11-20 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/LorenzoNava/cve-cwe-dataset-cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
# CVE-CWE Dataset (Cleaned) Cleaned version of the CVE-CWE dataset with only standard CWE classifications. ## Dataset Source **Original Dataset:** [stasvinokur/cve-and-cwe-dataset-1999-2025](https://huggingface.co/datasets/stasvinokur/cve-and-cwe-dataset-1999-2025) This dataset contains CVE (Common Vulnerabilities and Exposures) descriptions paired with their corresponding CWE (Common Weakness Enumeration) classifications from 1999-2025. ## Cleaning Process The original dataset contained **280,694 samples**. We performed the following cleaning: ### 1. Removed Non-Standard Classifications - **Removed:** 55,550 samples (19.79%) labeled as `"NVD-CWE-Other"` - **Reason:** "NVD-CWE-Other" is a catch-all category, not a specific weakness classification ### 2. Removed Missing Values - **Removed:** Samples with null or empty `CWE-ID` values - **Reason:** Cannot train on samples without target labels ### 3. Validated CWE Format - **Kept:** Only samples matching pattern `CWE-XXXX` (where XXXX is numeric) - **Example valid:** `CWE-79`, `CWE-119`, `CWE-89` - **Example removed:** `NVD-CWE-Other`, `null`, `""` ## Dataset Statistics | Metric | Value | |--------|-------| | **Total samples** | 225,144 | | **Unique CWE classes** | 695 | | **Removed samples** | 55,550 (19.79% of original) | | **Time range** | 1999-2025 | | **Language** | English | ## Dataset Structure ```python { "DESCRIPTION": str, # CVE vulnerability description "CWE-ID": str, # CWE classification (e.g., "CWE-79") } ``` ## Usage ```python from datasets import load_dataset # Load cleaned dataset dataset = load_dataset("LorenzoNava/cve-cwe-dataset-cleaned") # Example print(dataset['train'][0]) # { # 'DESCRIPTION': 'A buffer overflow in the web server...', # 'CWE-ID': 'CWE-119' # } ``` ## Top 10 Most Common CWEs (Statistics will be updated) 1. CWE-79 - Cross-site Scripting (XSS) 2. CWE-119 - Buffer Errors 3. CWE-200 - Information Exposure 4. CWE-20 - Improper Input Validation 5. CWE-89 - SQL Injection 6. CWE-264 - Permissions, Privileges, and Access Controls 7. CWE-399 - Resource Management Errors 8. CWE-287 - Improper Authentication 9. CWE-352 - Cross-Site Request Forgery (CSRF) 10. CWE-22 - Path Traversal ## Comparison with Original Dataset | Aspect | Original | Cleaned | |--------|----------|---------| | **Samples** | 280,694 | 225,144 | | **CWE classes** | 696 | 695 | | **Includes "NVD-CWE-Other"** | ✅ Yes | ❌ No | | **Only standard CWEs** | ❌ No | ✅ Yes | | **Best for** | Research, analysis | Training ML models | ## Use Cases ### ✅ Recommended For: - Training CWE classification models - Building automated vulnerability assessment tools - Researching CWE distribution patterns - Developing security ML applications ### ❌ Not Recommended For: - Analyzing "unknown" or uncategorized vulnerabilities - Studies requiring complete historical data - Research on NVD categorization practices For these use cases, use the [original dataset](https://huggingface.co/datasets/stasvinokur/cve-and-cwe-dataset-1999-2025) instead. ## Data Quality - ✅ All samples have valid CWE-ID - ✅ All CWE-IDs follow standard format (CWE-XXXX) - ✅ No duplicate removal (preserves all valid samples) - ✅ No text preprocessing (original CVE descriptions preserved) ## Citation If you use this dataset, please cite: ```bibtex @dataset{cve-cwe-cleaned-2024, author = {Berghem - Smart Information Security}, title = {CVE-CWE Dataset (Cleaned)}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/LorenzoNava/cve-cwe-dataset-cleaned}, note = {Cleaned version of stasvinokur/cve-and-cwe-dataset-1999-2025} } ``` **Original dataset citation:** ```bibtex @dataset{cve-cwe-original-2025, author = {Vinokur, Stas}, title = {CVE and CWE Dataset 1999-2025}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/stasvinokur/cve-and-cwe-dataset-1999-2025} } ``` ## License Same license as original dataset. ## Developed By **Berghem - Smart Information Security** For questions or issues, visit the [dataset repository](https://huggingface.co/datasets/LorenzoNava/cve-cwe-dataset-cleaned).
提供机构:
LorenzoNava
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作