ChamaraVishwajithRajapaksha/Code-Vulnerability-FineTune
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/ChamaraVishwajithRajapaksha/Code-Vulnerability-FineTune
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
tags:
- security
- cwe
- vulnerability
- code-analysis
- software-security
- fine-tuning
- sharegpt
- cybersecurity
- llm
task_categories:
- text-generation
- question-answering
pretty_name: Code Vulnerability FineTome (CWE-Enriched Conversation Dataset)
size_categories:
- 100K<n<1M
---
# 🔐 Code Vulnerability FineTome — CWE-Enriched Conversation Dataset
<p align="center">
<img src="https://img.shields.io/badge/Format-ShareGPT%20%2F%20FineTome-blue" />
<img src="https://img.shields.io/badge/License-MIT-green" />
<img src="https://img.shields.io/badge/Language-C%20%2F%20C%2B%2B-orange" />
<img src="https://img.shields.io/badge/Task-Vulnerability%20Detection-red" />
</p>
---
## 📌 Overview
This dataset converts raw security-labeled C/C++ code samples into **instruction-following conversation pairs** suitable for fine-tuning large language models (LLMs) on **software vulnerability detection and analysis**.
It is built by preprocessing and transforming the [`ChamaraVishwajithRajapaksha/Code_Vulnerability_Dataset`](https://huggingface.co/datasets/ChamaraVishwajithRajapaksha/Code_Vulnerability_Dataset) (330k rows, sourced from DiverseVul + MITRE CWE enrichment) into the **ShareGPT / FineTome conversation format** used by [`mlabonne/FineTome-100k`](https://huggingface.co/datasets/mlabonne/FineTome-100k).
### 🎯 Use Cases
- Fine-tuning LLMs for **security code review**
- Training **vulnerability detection** models
- Building **code-aware security assistants**
- Research in **automated static analysis** and secure coding
---
## 📊 Dataset Statistics
| Property | Value |
|---|---|
| **Source Dataset** | `ChamaraVishwajithRajapaksha/Code_Vulnerability_Dataset` |
| **Original Rows** | 330,492 |
| **Rows After Cleaning** | ~180,000+ |
| **Format** | ShareGPT (conversations) |
| **Languages** | C, C++ |
| **Splits** | `train` (90%) · `test` (10%) |
| **License** | MIT |
---
## 🗂️ Data Format
Each row follows the **ShareGPT conversation format** with two turns:
```json
{
"conversations": [
{
"from": "human",
"value": "Analyze the following code snippet and identify any security vulnerabilities...\n\n```c\n<source code>\n```"
},
{
"from": "gpt",
"value": "## Security Vulnerability Analysis\n\n⚠️ This code sample is marked as **Vulnerable**.\n\n### 🔍 Vulnerability Classification\n- **CWE ID**: CWE-787\n- **Type**: Out-of-bounds Write\n- **Severity**: High\n..."
}
],
"source": "code_vulnerability_cwe",
"score": 4.8
}
```
### Fields
| Field | Type | Description |
|---|---|---|
| `conversations` | `list` | List of 2 conversation turns |
| `conversations[0].from` | `str` | Always `"human"` |
| `conversations[0].value` | `str` | Instruction + C/C++ code block (from `func`) |
| `conversations[1].from` | `str` | Always `"gpt"` |
| `conversations[1].value` | `str` | Structured vulnerability analysis (from `cwe_details`) |
| `source` | `str` | Always `"code_vulnerability_cwe"` |
| `score` | `float` | Quality score (`4.8`) |
---
## 🔄 Preprocessing Pipeline
The raw dataset was transformed in the following steps:
### Step 1 — Load
Download the source dataset from Hugging Face Hub (330k rows, Parquet format).
### Step 2 — Filter
| Filter | Condition |
|---|---|
| `func` must exist | Non-null, length > 10 characters |
| `cwe_details` must be valid | Non-null, parseable as JSON |
| No duplicates | Drop duplicate `func + cwe_details` pairs |
### Step 3 — Transform Human Turn
The `func` (source code column) is wrapped in an instruction prompt:
```
Analyze the following code snippet and identify any security vulnerabilities.
Provide a detailed explanation of the vulnerability type, its severity, potential impact,
and the CWE classification.
```c
<source code here>
```
```
### Step 4 — Transform Assistant Turn
The `cwe_details` JSON is rendered into structured Markdown including:
- CWE ID and vulnerability type
- Severity and category
- Affected programming languages
- Potential impact (from MITRE CWE database)
- Security recommendation
- Whether the sample is `Vulnerable` or `Safe` (patched)
### Step 5 — Split & Push
- 90% / 10% train-test split (random seed 42)
- Pushed to Hugging Face Hub in Parquet format
---
## 📋 CWE Categories Covered
The dataset covers a wide range of Common Weakness Enumeration types including:
| CWE ID | Vulnerability Type |
|---|---|
| CWE-787 | Out-of-bounds Write |
| CWE-416 | Use After Free |
| CWE-125 | Out-of-bounds Read |
| CWE-476 | NULL Pointer Dereference |
| CWE-119 | Improper Buffer Restriction |
| CWE-284 | Improper Access Control |
| CWE-20 | Improper Input Validation |
| CWE-78 | OS Command Injection |
| CWE-327 | Broken Cryptographic Algorithm |
| CWE-835 | Infinite Loop |
| CWE-122 | Heap-based Buffer Overflow |
| CWE-362 | Race Condition |
| CWE-22 | Path Traversal |
| CWE-287 | Improper Authentication |
| CWE-319 | Cleartext Transmission |
| ... | 30+ more CWE types |
---
## 🧪 Sample Example
### Human Turn
```
Analyze the following code snippet and identify any security vulnerabilities.
Provide a detailed explanation of the vulnerability type, its severity, potential impact,
and the CWE classification.
```c
static boolean ReadICCProfile(j_decompress_ptr jpeg_info) {
char magick[12];
ErrorManager *error_manager;
ExceptionInfo *exception;
Image *image;
...
}
```
```
### Assistant Turn
```markdown
## Security Vulnerability Analysis
⚠️ This code sample is marked as **Vulnerable**.
### 🔍 Vulnerability Classification
- **CWE ID**: CWE-416
- **Type**: Use After Free
- **Severity**: High
- **Category**: Memory Corruption
- **Affected Languages**: C, C++
### 📋 Description
The product reuses or references memory after it has been freed. At some point
afterward, the memory may be allocated again and saved in another pointer, while
the original pointer references a location somewhere within the new allocation.
### 💥 Potential Impact
- Execute Unauthorized Code or Commands
- Read Memory
- Modify Memory
- DoS: Crash, Exit, or Restart
### 🛡️ Recommendation
Review the code for Use After Free patterns. Ensure proper bounds checking,
input validation, and memory management practices are applied as recommended
by the CWE guidelines for CWE-416.
```
---
## 🚀 Usage
### Load with 🤗 Datasets
```python
from datasets import load_dataset
dataset = load_dataset("YOUR_USERNAME/Code-Vulnerability-FineTome")
print(dataset)
# DatasetDict({
# train: Dataset({features: ['conversations', 'source', 'score'], num_rows: ...}),
# test: Dataset({features: ['conversations', 'source', 'score'], num_rows: ...})
# })
```
### Access a Sample
```python
sample = dataset['train'][0]
# Print the human question (code to analyze)
print(sample['conversations'][0]['value'])
# Print the assistant answer (vulnerability analysis)
print(sample['conversations'][1]['value'])
```
### Fine-tuning with Unsloth / TRL
```python
from trl import SFTTrainer
from unsloth import FastLanguageModel
# The dataset is already in ShareGPT format — compatible with
# most fine-tuning frameworks that support conversation datasets.
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset['train'],
dataset_text_field="conversations", # adjust per framework
...
)
```
---
## 📁 Source Dataset
This dataset is derived from:
- **[`ChamaraVishwajithRajapaksha/Code_Vulnerability_Dataset`](https://huggingface.co/datasets/ChamaraVishwajithRajapaksha/Code_Vulnerability_Dataset)**
- Originally built from [`bstee615/diversevul`](https://huggingface.co/datasets/bstee615/diversevul)
- CWE details enriched using the [MITRE CWE API](https://cwe.mitre.org/)
- **Format reference: [`mlabonne/FineTome-100k`](https://huggingface.co/datasets/mlabonne/FineTome-100k)**
- ShareGPT conversation structure used as the target format
---
## ⚠️ Limitations
- Code samples are primarily in **C and C++** — limited coverage of other languages
- Some rows were dropped due to **missing or malformed `cwe_details`**
- The `Safe` samples represent **patched/fixed** versions, not inherently safe code — context matters
- CWE details describe the **class of vulnerability**, not a precise analysis of each individual function
- This dataset is intended for **research and educational purposes**
---
## 📜 License
This dataset is released under the **MIT License**, consistent with the source dataset license.
---
## 🙏 Citation
If you use this dataset in your research, please cite the original source:
```bibtex
@dataset{code_vulnerability_finetome,
title = {Code Vulnerability FineTome: CWE-Enriched Conversation Dataset},
author = {Derived from ChamaraVishwajithRajapaksha/Code\_Vulnerability\_Dataset},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/YOUR_USERNAME/Code-Vulnerability-FineTome},
note = {Preprocessed into ShareGPT conversation format for LLM fine-tuning}
}
```
---
## 🔗 Related Resources
- [MITRE CWE Database](https://cwe.mitre.org/)
- [DiverseVul Paper](https://arxiv.org/abs/2304.00409)
- [FineTome-100k Format Reference](https://huggingface.co/datasets/mlabonne/FineTome-100k)
- [Unsloth Fine-tuning](https://github.com/unslothai/unsloth)
- [TRL SFTTrainer](https://huggingface.co/docs/trl/sft_trainer)
提供机构:
ChamaraVishwajithRajapaksha



