LorenzoNava/cve-cwe-dataset-cleaned
收藏Hugging Face2025-11-20 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/LorenzoNava/cve-cwe-dataset-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
# CVE-CWE Dataset (Cleaned)
Cleaned version of the CVE-CWE dataset with only standard CWE classifications.
## Dataset Source
**Original Dataset:** [stasvinokur/cve-and-cwe-dataset-1999-2025](https://huggingface.co/datasets/stasvinokur/cve-and-cwe-dataset-1999-2025)
This dataset contains CVE (Common Vulnerabilities and Exposures) descriptions paired with their corresponding CWE (Common Weakness Enumeration) classifications from 1999-2025.
## Cleaning Process
The original dataset contained **280,694 samples**. We performed the following cleaning:
### 1. Removed Non-Standard Classifications
- **Removed:** 55,550 samples (19.79%) labeled as `"NVD-CWE-Other"`
- **Reason:** "NVD-CWE-Other" is a catch-all category, not a specific weakness classification
### 2. Removed Missing Values
- **Removed:** Samples with null or empty `CWE-ID` values
- **Reason:** Cannot train on samples without target labels
### 3. Validated CWE Format
- **Kept:** Only samples matching pattern `CWE-XXXX` (where XXXX is numeric)
- **Example valid:** `CWE-79`, `CWE-119`, `CWE-89`
- **Example removed:** `NVD-CWE-Other`, `null`, `""`
## Dataset Statistics
| Metric | Value |
|--------|-------|
| **Total samples** | 225,144 |
| **Unique CWE classes** | 695 |
| **Removed samples** | 55,550 (19.79% of original) |
| **Time range** | 1999-2025 |
| **Language** | English |
## Dataset Structure
```python
{
"DESCRIPTION": str, # CVE vulnerability description
"CWE-ID": str, # CWE classification (e.g., "CWE-79")
}
```
## Usage
```python
from datasets import load_dataset
# Load cleaned dataset
dataset = load_dataset("LorenzoNava/cve-cwe-dataset-cleaned")
# Example
print(dataset['train'][0])
# {
# 'DESCRIPTION': 'A buffer overflow in the web server...',
# 'CWE-ID': 'CWE-119'
# }
```
## Top 10 Most Common CWEs
(Statistics will be updated)
1. CWE-79 - Cross-site Scripting (XSS)
2. CWE-119 - Buffer Errors
3. CWE-200 - Information Exposure
4. CWE-20 - Improper Input Validation
5. CWE-89 - SQL Injection
6. CWE-264 - Permissions, Privileges, and Access Controls
7. CWE-399 - Resource Management Errors
8. CWE-287 - Improper Authentication
9. CWE-352 - Cross-Site Request Forgery (CSRF)
10. CWE-22 - Path Traversal
## Comparison with Original Dataset
| Aspect | Original | Cleaned |
|--------|----------|---------|
| **Samples** | 280,694 | 225,144 |
| **CWE classes** | 696 | 695 |
| **Includes "NVD-CWE-Other"** | ✅ Yes | ❌ No |
| **Only standard CWEs** | ❌ No | ✅ Yes |
| **Best for** | Research, analysis | Training ML models |
## Use Cases
### ✅ Recommended For:
- Training CWE classification models
- Building automated vulnerability assessment tools
- Researching CWE distribution patterns
- Developing security ML applications
### ❌ Not Recommended For:
- Analyzing "unknown" or uncategorized vulnerabilities
- Studies requiring complete historical data
- Research on NVD categorization practices
For these use cases, use the [original dataset](https://huggingface.co/datasets/stasvinokur/cve-and-cwe-dataset-1999-2025) instead.
## Data Quality
- ✅ All samples have valid CWE-ID
- ✅ All CWE-IDs follow standard format (CWE-XXXX)
- ✅ No duplicate removal (preserves all valid samples)
- ✅ No text preprocessing (original CVE descriptions preserved)
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{cve-cwe-cleaned-2024,
author = {Berghem - Smart Information Security},
title = {CVE-CWE Dataset (Cleaned)},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/LorenzoNava/cve-cwe-dataset-cleaned},
note = {Cleaned version of stasvinokur/cve-and-cwe-dataset-1999-2025}
}
```
**Original dataset citation:**
```bibtex
@dataset{cve-cwe-original-2025,
author = {Vinokur, Stas},
title = {CVE and CWE Dataset 1999-2025},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/stasvinokur/cve-and-cwe-dataset-1999-2025}
}
```
## License
Same license as original dataset.
## Developed By
**Berghem - Smart Information Security**
For questions or issues, visit the [dataset repository](https://huggingface.co/datasets/LorenzoNava/cve-cwe-dataset-cleaned).
提供机构:
LorenzoNava



