five

ukcli/All-CVE-Records-Training-Dataset

收藏
Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ukcli/All-CVE-Records-Training-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - cybersecurity - cve - vulnerability size_categories: - 100K<n<1M --- # CVE Chat‑Style Multi‑Turn Cybersecurity Dataset (1999 – 2025) &#x20; ## 1. Project Overview This repository hosts the **largest publicly available chat‑style, multi‑turn cybersecurity dataset to date**, containing **≈ 300 000 Common Vulnerabilities and Exposures (CVE) records** published between **1999 and 2025**. Each record has been meticulously parsed, enriched, and converted into a conversational format that is ideal for training and evaluating AI and AI‑Agent systems focused on vulnerability analysis, threat intelligence, and cyber‑defense automation. ## 2. Key Highlights. Key Highlights | Feature | Description | | ------------------- | --------------------------------------------------------------------------------------- | | Records | \~300 k CVE entries (1999‑2025) | | Formats Covered | CVE 4.0 (legacy) & CVE 5.0+ (modern) | | Parsing Accuracy | **100 %** (validated) | | Enrichments | CVSS v2 & v3 metrics · CWE taxonomy · Affected‑product matrices · Expert system prompts | | Conversation Depth | Multi‑turn (System / User / Assistant) | | Processing Pipeline | Fully asynchronous, linearly scalable data‑engineering architecture | | License | Apache license 2.0 | ## 3. Intended Use Cases - **Fine‑tuning LLMs** for vulnerability triage and severity prediction. - **Temporal trend analysis** of vulnerability disclosures. - **Retrieval‑Augmented Generation (RAG)** and autonomous **AI‑Agent** pipelines. - **Real‑time threat‑intelligence** enrichment services. - **Automated penetration‑testing** (pentest) orchestration. > **Benchmark Note**\ > Early experiments with *Llama 3.2* and *Gemma* models achieved **94 % accuracy** on CVE class‑prediction tasks after full fine‑tuning on this dataset. ## 4. Dataset Structure Each dialogue is stored as a single **JSON Lines (`.jsonl`)** object with **three top‑level keys**: ```json { "System": "You are a cybersecurity expert specializing in penetration testing, vulnerability research, and exploit development. Provide comprehensive technical analysis of CVE vulnerabilities with academic rigor and practical exploitation insights.", "User": "Provide a comprehensive technical analysis of CVE‑2010‑3763, including exploitation vectors, impact assessment, and remediation strategies.", "Assistant": "## CVE‑2010‑3763 Vulnerability Details ### CVE Metadata - **CVE ID**: CVE‑2010‑3763 - **State**: PUBLISHED ..." } ``` ### Field Reference | Key | Type | Description | |-------------|--------|---------------------------------------------------------------------| | `System` | string | System prompt that frames the assistant’s role and response style. | | `User` | string | End‑user request or question. | | `Assistant` | string | Model answer containing enriched CVE analysis and metadata. | > **Note**: Multi‑turn conversations are represented as separate JSONL lines that share the same `System` context while `User` and `Assistant` evolve turn by turn. ## 5. Processing Pipeline. Processing Pipeline 1. **Source Aggregation** – CVE XML feeds (4.0) + JSON feeds (5.0+). 2. **Asynchronous Parsing** – Custom Rust & Python pipeline (Tokio + asyncio) for 100 % parsing success. 3. **Enrichment Layer** – CVSS scoring, CWE classification, product‑matrix generation. 4. **Conversation Generation** – Expert prompts injected to produce System / User / Assistant structure. 5. **Validation & QA** – Schema checks, de‑duplication, manual spot‑checks. ## 6. Quick Start ### Load with 🤗 `datasets` ```python from datasets import load_dataset cve_chat = load_dataset("<username>/<repo_name>", split="train") print(cve_chat[0]) ``` ### Finetune Example (PEFT & QLoRA) ```bash python train.py \ --model "meta-llama/Meta-Llama-3-8B" \ --dataset "<username>/<repo_name>" \ --peft lora \ --bits 4 ``` ## 7. Data Splits | Split | Records | Notes | | ------------ | ------- | ----- | | `train` | 240 000 | 80 % | | `validation` | 30 000 | 10 % | | `test` | 27 441 | 10 % | ## 8. Contact Contributions, feedback, and pull requests are warmly welcomed!
提供机构:
ukcli
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作