All-CVE-Chat-MultiTurn-1999-2025-Dataset
收藏魔搭社区2025-12-26 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/Trendyol/All-CVE-Chat-MultiTurn-1999-2025-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
# CVE Chat‑Style Multi‑Turn Cybersecurity Dataset (1999 – 2025)
 
## 1. Project Overview
This repository hosts the **largest publicly available chat‑style, multi‑turn cybersecurity dataset to date**, containing **≈ 300 000 Common Vulnerabilities and Exposures (CVE) records** published between **1999 and 2025**. Each record has been meticulously parsed, enriched, and converted into a conversational format that is ideal for training and evaluating AI and AI‑Agent systems focused on vulnerability analysis, threat intelligence, and cyber‑defense automation.
## 2. Key Highlights. Key Highlights
| Feature | Description |
| ------------------- | --------------------------------------------------------------------------------------- |
| Records | \~300 k CVE entries (1999‑2025) |
| Formats Covered | CVE 4.0 (legacy) & CVE 5.0+ (modern) |
| Parsing Accuracy | **100 %** (validated) |
| Enrichments | CVSS v2 & v3 metrics · CWE taxonomy · Affected‑product matrices · Expert system prompts |
| Conversation Depth | Multi‑turn (System / User / Assistant) |
| Processing Pipeline | Fully asynchronous, linearly scalable data‑engineering architecture |
| License | Apache license 2.0 |
## 3. Intended Use Cases
- **Fine‑tuning LLMs** for vulnerability triage and severity prediction.
- **Temporal trend analysis** of vulnerability disclosures.
- **Retrieval‑Augmented Generation (RAG)** and autonomous **AI‑Agent** pipelines.
- **Real‑time threat‑intelligence** enrichment services.
- **Automated penetration‑testing** (pentest) orchestration.
> **Benchmark Note**\
> Early experiments with *Llama 3.2* and *Gemma* models achieved **94 % accuracy** on CVE class‑prediction tasks after full fine‑tuning on this dataset.
## 4. Dataset Structure
Each dialogue is stored as a single **JSON Lines (`.jsonl`)** object with **three top‑level keys**:
```json
{
"System": "You are a cybersecurity expert specializing in penetration testing, vulnerability research, and exploit development. Provide comprehensive technical analysis of CVE vulnerabilities with academic rigor and practical exploitation insights.",
"User": "Provide a comprehensive technical analysis of CVE‑2010‑3763, including exploitation vectors, impact assessment, and remediation strategies.",
"Assistant": "## CVE‑2010‑3763 Vulnerability Details
### CVE Metadata
- **CVE ID**: CVE‑2010‑3763
- **State**: PUBLISHED
..."
}
```
### Field Reference
| Key | Type | Description |
|-------------|--------|---------------------------------------------------------------------|
| `System` | string | System prompt that frames the assistant’s role and response style. |
| `User` | string | End‑user request or question. |
| `Assistant` | string | Model answer containing enriched CVE analysis and metadata. |
> **Note**: Multi‑turn conversations are represented as separate JSONL lines that share the same `System` context while `User` and `Assistant` evolve turn by turn.
## 5. Processing Pipeline. Processing Pipeline
1. **Source Aggregation** – CVE XML feeds (4.0) + JSON feeds (5.0+).
2. **Asynchronous Parsing** – Custom Rust & Python pipeline (Tokio + asyncio) for 100 % parsing success.
3. **Enrichment Layer** – CVSS scoring, CWE classification, product‑matrix generation.
4. **Conversation Generation** – Expert prompts injected to produce System / User / Assistant structure.
5. **Validation & QA** – Schema checks, de‑duplication, manual spot‑checks.
## 6. Quick Start
### Load with 🤗 `datasets`
```python
from datasets import load_dataset
cve_chat = load_dataset("<username>/<repo_name>", split="train")
print(cve_chat[0])
```
### Finetune Example (PEFT & QLoRA)
```bash
python train.py \
--model "meta-llama/Meta-Llama-3-8B" \
--dataset "<username>/<repo_name>" \
--peft lora \
--bits 4
```
## 7. Data Splits
| Split | Records | Notes |
| ------------ | ------- | ----- |
| `train` | 240 000 | 80 % |
| `validation` | 30 000 | 10 % |
| `test` | 27 441 | 10 % |
## 8. Contact
Contributions, feedback, and pull requests are warmly welcomed!
# CVE对话式多轮网络安全数据集(1999 – 2025)
## 1. 项目概述
本仓库收录了迄今为止规模最大的公开可用对话式多轮网络安全数据集,包含1999年至2025年间发布的约30万条通用漏洞披露(Common Vulnerabilities and Exposures, CVE)记录。每条记录均经过精细化解析、扩充,并转换为对话格式,非常适合训练和评估面向漏洞分析、威胁情报及网络防御自动化的人工智能(AI)与AI智能体(AI Agent)系统。
## 2. 核心亮点
| 特征项 | 描述 |
| ------------------- | --------------------------------------------------------------------------------------- |
| 记录数 | 约30万条CVE条目(1999‑2025) |
| 支持格式 | CVE 4.0(旧版)与CVE 5.0+(现代版) |
| 解析准确率 | **100 %**(经验证) |
| 扩充内容 | 通用漏洞评分系统(Common Vulnerability Scoring System, CVSS)v2与v3指标、通用弱点枚举(Common Weakness Enumeration, CWE)分类法、受影响产品矩阵、专家系统提示词 |
| 对话深度 | 多轮对话(系统 / 用户 / 助手) |
| 处理流水线 | 全异步、可线性扩展的数据工程架构 |
| 许可证 | Apache许可证2.0 |
## 3. 预期应用场景
- **大语言模型(Large Language Model, LLM)微调**:用于漏洞分类与严重性预测。
- 漏洞披露时间趋势分析。
- **检索增强生成(Retrieval-Augmented Generation, RAG)**与自主AI智能体流水线。
- 实时威胁情报扩充服务。
- 自动化渗透测试(渗透测试)编排。
> **基准测试说明**
> 基于本数据集完成全量微调后,针对Llama 3.2与Gemma模型的早期实验在CVE分类预测任务中实现了**94 %**的准确率。
## 4. 数据集结构
每条对话以单个**JSON Lines(.jsonl)**对象存储,包含三个顶级键:
json
{
"System": "您是一名专注于渗透测试、漏洞研究与漏洞利用开发的网络安全专家,请以学术严谨性与实际利用视角,对CVE漏洞提供全面的技术分析。",
"User": "请对CVE‑2010‑3763开展全面技术分析,包括利用向量、影响评估与修复策略。",
"Assistant": "## CVE‑2010‑3763 漏洞详情
### CVE 元数据
- **CVE ID**: CVE‑2010‑3763
- **状态**: 已发布
..."
}
### 字段说明
| 键名 | 类型 | 描述 |
|-------------|--------|---------------------------------------------------------------------|
| `System` | 字符串 | 用于定义助手角色与回复风格的系统提示词。 |
| `User` | 字符串 | 终端用户的请求或问题。 |
| `Assistant` | 字符串 | 包含扩充后的CVE分析内容与元数据的模型回复。 |
> **注意**:多轮对话以独立的JSONL行表示,共享相同的`System`上下文,`User`与`Assistant`内容随对话轮次逐步演进。
## 5. 处理流水线
1. **源数据聚合** – CVE XML数据源(4.0版)与JSON数据源(5.0+版)。
2. **异步解析** – 基于自定义Rust与Python流水线(Tokio + asyncio)实现100%解析成功率。
3. **扩充层** – CVSS评分、CWE分类、受影响产品矩阵生成。
4. **对话生成** – 注入专家提示词以生成系统/用户/助手对话结构。
5. **验证与质量保证** – 模式校验、去重、人工抽样检查。
## 6. 快速上手
### 使用🤗 `datasets`库加载
python
from datasets import load_dataset
cve_chat = load_dataset("<username>/<repo_name>", split="train")
print(cve_chat[0])
### 微调示例(PEFT与QLoRA)
bash
python train.py
--model "meta-llama/Meta-Llama-3-8B"
--dataset "<username>/<repo_name>"
--peft lora
--bits 4
## 7. 数据拆分
| 拆分集 | 记录数 | 占比 |
| ------------ | ------- | ----- |
| `train`(训练集) | 240 000 | 80 % |
| `validation`(验证集) | 30 000 | 10 % |
| `test`(测试集) | 27 441 | 10 % |
## 8. 联系方式
欢迎贡献代码、反馈意见与提交拉取请求!
提供机构:
maas
创建时间:
2025-08-01



