Name: ukcli/All-CVE-Records-Training-Dataset
Creator: ukcli
Published: 2026-04-13 03:37:14
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/ukcli/All-CVE-Records-Training-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en tags: - cybersecurity - cve - vulnerability size_categories: - 100K<n<1M --- # CVE Chat‑Style Multi‑Turn Cybersecurity Dataset (1999 – 2025)   ## 1. Project Overview This repository hosts the **largest publicly available chat‑style, multi‑turn cybersecurity dataset to date**, containing **≈ 300 000 Common Vulnerabilities and Exposures (CVE) records** published between **1999 and 2025**. Each record has been meticulously parsed, enriched, and converted into a conversational format that is ideal for training and evaluating AI and AI‑Agent systems focused on vulnerability analysis, threat intelligence, and cyber‑defense automation. ## 2. Key Highlights. Key Highlights | Feature | Description | | ------------------- | --------------------------------------------------------------------------------------- | | Records | \~300 k CVE entries (1999‑2025) | | Formats Covered | CVE 4.0 (legacy) & CVE 5.0+ (modern) | | Parsing Accuracy | **100 %** (validated) | | Enrichments | CVSS v2 & v3 metrics · CWE taxonomy · Affected‑product matrices · Expert system prompts | | Conversation Depth | Multi‑turn (System / User / Assistant) | | Processing Pipeline | Fully asynchronous, linearly scalable data‑engineering architecture | | License | Apache license 2.0 | ## 3. Intended Use Cases - **Fine‑tuning LLMs** for vulnerability triage and severity prediction. - **Temporal trend analysis** of vulnerability disclosures. - **Retrieval‑Augmented Generation (RAG)** and autonomous **AI‑Agent** pipelines. - **Real‑time threat‑intelligence** enrichment services. - **Automated penetration‑testing** (pentest) orchestration. > **Benchmark Note**\ > Early experiments with *Llama 3.2* and *Gemma* models achieved **94 % accuracy** on CVE class‑prediction tasks after full fine‑tuning on this dataset. ## 4. Dataset Structure Each dialogue is stored as a single **JSON Lines (`.jsonl`)** object with **three top‑level keys**: ```json { "System": "You are a cybersecurity expert specializing in penetration testing, vulnerability research, and exploit development. Provide comprehensive technical analysis of CVE vulnerabilities with academic rigor and practical exploitation insights.", "User": "Provide a comprehensive technical analysis of CVE‑2010‑3763, including exploitation vectors, impact assessment, and remediation strategies.", "Assistant": "## CVE‑2010‑3763 Vulnerability Details ### CVE Metadata - **CVE ID**: CVE‑2010‑3763 - **State**: PUBLISHED ..." } ``` ### Field Reference | Key | Type | Description | |-------------|--------|---------------------------------------------------------------------| | `System` | string | System prompt that frames the assistant’s role and response style. | | `User` | string | End‑user request or question. | | `Assistant` | string | Model answer containing enriched CVE analysis and metadata. | > **Note**: Multi‑turn conversations are represented as separate JSONL lines that share the same `System` context while `User` and `Assistant` evolve turn by turn. ## 5. Processing Pipeline. Processing Pipeline 1. **Source Aggregation** – CVE XML feeds (4.0) + JSON feeds (5.0+). 2. **Asynchronous Parsing** – Custom Rust & Python pipeline (Tokio + asyncio) for 100 % parsing success. 3. **Enrichment Layer** – CVSS scoring, CWE classification, product‑matrix generation. 4. **Conversation Generation** – Expert prompts injected to produce System / User / Assistant structure. 5. **Validation & QA** – Schema checks, de‑duplication, manual spot‑checks. ## 6. Quick Start ### Load with 🤗 `datasets` ```python from datasets import load_dataset cve_chat = load_dataset("<username>/<repo_name>", split="train") print(cve_chat[0]) ``` ### Finetune Example (PEFT & QLoRA) ```bash python train.py \ --model "meta-llama/Meta-Llama-3-8B" \ --dataset "<username>/<repo_name>" \ --peft lora \ --bits 4 ``` ## 7. Data Splits | Split | Records | Notes | | ------------ | ------- | ----- | | `train` | 240 000 | 80 % | | `validation` | 30 000 | 10 % | | `test` | 27 441 | 10 % | ## 8. Contact Contributions, feedback, and pull requests are warmly welcomed!

应用场景：