five

ChamaraVishwajithRajapaksha/Code-Vulnerability-FineTune

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/ChamaraVishwajithRajapaksha/Code-Vulnerability-FineTune
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en tags: - security - cwe - vulnerability - code-analysis - software-security - fine-tuning - sharegpt - cybersecurity - llm task_categories: - text-generation - question-answering pretty_name: Code Vulnerability FineTome (CWE-Enriched Conversation Dataset) size_categories: - 100K<n<1M --- # 🔐 Code Vulnerability FineTome — CWE-Enriched Conversation Dataset <p align="center"> <img src="https://img.shields.io/badge/Format-ShareGPT%20%2F%20FineTome-blue" /> <img src="https://img.shields.io/badge/License-MIT-green" /> <img src="https://img.shields.io/badge/Language-C%20%2F%20C%2B%2B-orange" /> <img src="https://img.shields.io/badge/Task-Vulnerability%20Detection-red" /> </p> --- ## 📌 Overview This dataset converts raw security-labeled C/C++ code samples into **instruction-following conversation pairs** suitable for fine-tuning large language models (LLMs) on **software vulnerability detection and analysis**. It is built by preprocessing and transforming the [`ChamaraVishwajithRajapaksha/Code_Vulnerability_Dataset`](https://huggingface.co/datasets/ChamaraVishwajithRajapaksha/Code_Vulnerability_Dataset) (330k rows, sourced from DiverseVul + MITRE CWE enrichment) into the **ShareGPT / FineTome conversation format** used by [`mlabonne/FineTome-100k`](https://huggingface.co/datasets/mlabonne/FineTome-100k). ### 🎯 Use Cases - Fine-tuning LLMs for **security code review** - Training **vulnerability detection** models - Building **code-aware security assistants** - Research in **automated static analysis** and secure coding --- ## 📊 Dataset Statistics | Property | Value | |---|---| | **Source Dataset** | `ChamaraVishwajithRajapaksha/Code_Vulnerability_Dataset` | | **Original Rows** | 330,492 | | **Rows After Cleaning** | ~180,000+ | | **Format** | ShareGPT (conversations) | | **Languages** | C, C++ | | **Splits** | `train` (90%) · `test` (10%) | | **License** | MIT | --- ## 🗂️ Data Format Each row follows the **ShareGPT conversation format** with two turns: ```json { "conversations": [ { "from": "human", "value": "Analyze the following code snippet and identify any security vulnerabilities...\n\n```c\n<source code>\n```" }, { "from": "gpt", "value": "## Security Vulnerability Analysis\n\n⚠️ This code sample is marked as **Vulnerable**.\n\n### 🔍 Vulnerability Classification\n- **CWE ID**: CWE-787\n- **Type**: Out-of-bounds Write\n- **Severity**: High\n..." } ], "source": "code_vulnerability_cwe", "score": 4.8 } ``` ### Fields | Field | Type | Description | |---|---|---| | `conversations` | `list` | List of 2 conversation turns | | `conversations[0].from` | `str` | Always `"human"` | | `conversations[0].value` | `str` | Instruction + C/C++ code block (from `func`) | | `conversations[1].from` | `str` | Always `"gpt"` | | `conversations[1].value` | `str` | Structured vulnerability analysis (from `cwe_details`) | | `source` | `str` | Always `"code_vulnerability_cwe"` | | `score` | `float` | Quality score (`4.8`) | --- ## 🔄 Preprocessing Pipeline The raw dataset was transformed in the following steps: ### Step 1 — Load Download the source dataset from Hugging Face Hub (330k rows, Parquet format). ### Step 2 — Filter | Filter | Condition | |---|---| | `func` must exist | Non-null, length > 10 characters | | `cwe_details` must be valid | Non-null, parseable as JSON | | No duplicates | Drop duplicate `func + cwe_details` pairs | ### Step 3 — Transform Human Turn The `func` (source code column) is wrapped in an instruction prompt: ``` Analyze the following code snippet and identify any security vulnerabilities. Provide a detailed explanation of the vulnerability type, its severity, potential impact, and the CWE classification. ```c <source code here> ``` ``` ### Step 4 — Transform Assistant Turn The `cwe_details` JSON is rendered into structured Markdown including: - CWE ID and vulnerability type - Severity and category - Affected programming languages - Potential impact (from MITRE CWE database) - Security recommendation - Whether the sample is `Vulnerable` or `Safe` (patched) ### Step 5 — Split & Push - 90% / 10% train-test split (random seed 42) - Pushed to Hugging Face Hub in Parquet format --- ## 📋 CWE Categories Covered The dataset covers a wide range of Common Weakness Enumeration types including: | CWE ID | Vulnerability Type | |---|---| | CWE-787 | Out-of-bounds Write | | CWE-416 | Use After Free | | CWE-125 | Out-of-bounds Read | | CWE-476 | NULL Pointer Dereference | | CWE-119 | Improper Buffer Restriction | | CWE-284 | Improper Access Control | | CWE-20 | Improper Input Validation | | CWE-78 | OS Command Injection | | CWE-327 | Broken Cryptographic Algorithm | | CWE-835 | Infinite Loop | | CWE-122 | Heap-based Buffer Overflow | | CWE-362 | Race Condition | | CWE-22 | Path Traversal | | CWE-287 | Improper Authentication | | CWE-319 | Cleartext Transmission | | ... | 30+ more CWE types | --- ## 🧪 Sample Example ### Human Turn ``` Analyze the following code snippet and identify any security vulnerabilities. Provide a detailed explanation of the vulnerability type, its severity, potential impact, and the CWE classification. ```c static boolean ReadICCProfile(j_decompress_ptr jpeg_info) { char magick[12]; ErrorManager *error_manager; ExceptionInfo *exception; Image *image; ... } ``` ``` ### Assistant Turn ```markdown ## Security Vulnerability Analysis ⚠️ This code sample is marked as **Vulnerable**. ### 🔍 Vulnerability Classification - **CWE ID**: CWE-416 - **Type**: Use After Free - **Severity**: High - **Category**: Memory Corruption - **Affected Languages**: C, C++ ### 📋 Description The product reuses or references memory after it has been freed. At some point afterward, the memory may be allocated again and saved in another pointer, while the original pointer references a location somewhere within the new allocation. ### 💥 Potential Impact - Execute Unauthorized Code or Commands - Read Memory - Modify Memory - DoS: Crash, Exit, or Restart ### 🛡️ Recommendation Review the code for Use After Free patterns. Ensure proper bounds checking, input validation, and memory management practices are applied as recommended by the CWE guidelines for CWE-416. ``` --- ## 🚀 Usage ### Load with 🤗 Datasets ```python from datasets import load_dataset dataset = load_dataset("YOUR_USERNAME/Code-Vulnerability-FineTome") print(dataset) # DatasetDict({ # train: Dataset({features: ['conversations', 'source', 'score'], num_rows: ...}), # test: Dataset({features: ['conversations', 'source', 'score'], num_rows: ...}) # }) ``` ### Access a Sample ```python sample = dataset['train'][0] # Print the human question (code to analyze) print(sample['conversations'][0]['value']) # Print the assistant answer (vulnerability analysis) print(sample['conversations'][1]['value']) ``` ### Fine-tuning with Unsloth / TRL ```python from trl import SFTTrainer from unsloth import FastLanguageModel # The dataset is already in ShareGPT format — compatible with # most fine-tuning frameworks that support conversation datasets. trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset['train'], dataset_text_field="conversations", # adjust per framework ... ) ``` --- ## 📁 Source Dataset This dataset is derived from: - **[`ChamaraVishwajithRajapaksha/Code_Vulnerability_Dataset`](https://huggingface.co/datasets/ChamaraVishwajithRajapaksha/Code_Vulnerability_Dataset)** - Originally built from [`bstee615/diversevul`](https://huggingface.co/datasets/bstee615/diversevul) - CWE details enriched using the [MITRE CWE API](https://cwe.mitre.org/) - **Format reference: [`mlabonne/FineTome-100k`](https://huggingface.co/datasets/mlabonne/FineTome-100k)** - ShareGPT conversation structure used as the target format --- ## ⚠️ Limitations - Code samples are primarily in **C and C++** — limited coverage of other languages - Some rows were dropped due to **missing or malformed `cwe_details`** - The `Safe` samples represent **patched/fixed** versions, not inherently safe code — context matters - CWE details describe the **class of vulnerability**, not a precise analysis of each individual function - This dataset is intended for **research and educational purposes** --- ## 📜 License This dataset is released under the **MIT License**, consistent with the source dataset license. --- ## 🙏 Citation If you use this dataset in your research, please cite the original source: ```bibtex @dataset{code_vulnerability_finetome, title = {Code Vulnerability FineTome: CWE-Enriched Conversation Dataset}, author = {Derived from ChamaraVishwajithRajapaksha/Code\_Vulnerability\_Dataset}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/YOUR_USERNAME/Code-Vulnerability-FineTome}, note = {Preprocessed into ShareGPT conversation format for LLM fine-tuning} } ``` --- ## 🔗 Related Resources - [MITRE CWE Database](https://cwe.mitre.org/) - [DiverseVul Paper](https://arxiv.org/abs/2304.00409) - [FineTome-100k Format Reference](https://huggingface.co/datasets/mlabonne/FineTome-100k) - [Unsloth Fine-tuning](https://github.com/unslothai/unsloth) - [TRL SFTTrainer](https://huggingface.co/docs/trl/sft_trainer)
提供机构:
ChamaraVishwajithRajapaksha
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作