invinciblejha01/Cybersecurity-Dataset-Fenrir-v2.0
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/invinciblejha01/Cybersecurity-Dataset-Fenrir-v2.0
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- cybersecurity
- defensive-security
- instruction-tuning
size_categories:
- 10K<n<100K
dataset_info:
version: 1.1.0
---
# Cybersecurity Defense Instruction-Tuning Dataset (v2.0)
<img src="https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0/resolve/main/Fenrir.png" width="700" />
Created by Alican Kiraz
## TL;DR
A ready-to-train dataset of **83,920** high-quality *system / user / assistant* triples for **defensive, alignment-safe cybersecurity SFT** training.
Apache-2.0 licensed and production-ready.
**Scope:** OWASP Top 10, MITRE ATT&CK, NIST CSF, CIS Controls, ASD Essential 8, modern authentication (OAuth 2 / OIDC / SAML), SSL / TLS, Cloud & DevSecOps, Cryptography, and AI Security.
---
## 1 What’s new in v2.0 (2025‑10‑06)
| Change | v1.1.0 | **v2.0.0** |
| ----------------- | ------------------------------------: | ----------------------------------------------------------------------------------------------: |
| **Rows** | 21 258 | **83 920** *(≈ 4×)* |
| **Coverage** | OWASP, NIST CSF, +MITRE ATT&CK, CIS | + Deeper Cloud (AWS/Azure/GCP), modern auth hardening, crypto hygiene, AI‑security interplay |
| **Quality gates** | Dedup, PII scrub, hallucination scans | + adversarial refusal tests (jailbreak/injection), static policy linting, content risk taxonomy |
| **Format** | Parquet (chat triples) | + stricter schema checks, stable row IDs |
| **License** | Apache‑2.0 | Apache‑2.0 |
**Highlights**
* **Big**: 83,920 chat samples with explanations at **senior security engineer** level.
* **Safe-by-design**: Built-in **rejection patterns** and *alignment* checks against malicious or exploit requests.
* **Framework-aware**: Content is **mapped to standards** (OWASP / ATT&CK / NIST / CIS).
* **Cloud & DevSecOps first**: IAM, secrets, CI/CD, container / k8s hardening, logging / SIEM, incident response.
* **Commercial-friendly**: **Apache-2.0** licensed.
---
## 2 Dataset Summary
| Property | Value |
| ------------ | ------------------------------------------------------ |
| **Language** | English |
| **License** | Apache‑2.0 |
| **Format** | Parquet (columnar) |
| **Rows** | **83 920** |
| **Columns** | `system`, `user`, `assistant` |
| **Split** | `train` (100 %) |
**Record schema (chat triple)**
```json
{
"system": "You are a seasoned cyber‑defense AI that follows industry ethics...",
"user": "Compare mitigations for Reflected vs Stored XSS in a modern SPA.",
"assistant": "Reflected and Stored XSS share core mitigation pillars—output encoding..."
}
```
---
## 3. Coverage & Design
### 3.1 Domains & Frameworks
* **AppSec & Web**: OWASP Top 10, secure coding, input/output handling, SSRF, deserialization.
* **Cloud Security**: IAM guardrails, least privilege, key rotation, KMS/HSM, network segmentation, posture mgmt.
* **DevSecOps**: SAST/DAST, SBOM, supply‑chain, CI/CD signing, container & Kubernetes hardening.
* **Identity & Access**: OAuth2/OIDC/SAML, MFA/Phishing‑resistant auth, session mgmt.
* **Crypto Hygiene**: TLS configs, AEAD modes, key lifecycle, randomness, password hashing.
* **Detection & Response**: logging, SIEM correlation, threat hunting, IR playbooks.
* **AI‑Security Interplay**: prompt injection defense, data‑poisoning awareness, model‑misuse refusals.
### 3.2 Instruction styles
* Compare/contrast, step‑by‑step mitigation, checklists, “why it fails” root‑cause analyses, policy rationale, trade‑offs, and “refuse with explanation” for dual‑use prompts.
---
## 4. Data Creation & Quality
1. **Source harvesting**: 250 k+ public technical docs (standards, RFCs, white‑papers, vendor guidance).
2. **Extraction**: boilerplate stripping, language detection, heuristic paragraph segmentation.
3. **Topical filtering**: keyword+embedding retrieval towards defensive security only.
4. **Instruction synthesis**: prompts → *system/user/assistant*; enforced ethics & refusal templates.
5. **Quality gates** *(multi‑layer)*
* **Deduplication**: MinHash + LSH cluster pruning.
* **PII & profanity scrub**.
* **Hallucination/inconsistency scans** (LLM‑aided).
* **Refusal‑pattern tests**: jailbreak & prompt‑injection triggers; no exploit‑building steps.
* **Manual spot review** (~3 % sample).
---
## 5. Ethical Use & Safety
* **Dual‑use risk**: Dataset intentionally avoids exploit crafting; offensive requests receive **explanatory refusals**.
* **Bias**: Focus on widely used frameworks (OWASP/NIST/CIS).
* *Roadmap*: more regional standards (e.g., ISO/IEC, GDPR security controls).
* **Provenance**: Only public sources; licensing respected; outputs released under **Apache‑2.0**.
---
## 6. Limitations
* English‑only.
* Predominantly defensive stance; red‑team tactics only for mitigation context.
* Security evolves rapidly; periodic refresh planned.
---
## 7. Example Records
**Mitigation checklist:** hardening steps, rationales, pitfalls, references to standards.
**Refusal sample:** clearly declines malware/exploit construction with safe alternatives (logging, detection, patching).
> *All examples adhere to the `system/user/assistant` schema and are engineered to be alignment‑safe.*
---
## 8. Citation
```bibtex
@dataset{alican_kiraz_2025_heimdall_v2_0,
author = {Alican Kiraz},
title = {Fenrir v2.0 — Cybersecurity Defense Instruction-Tuning Dataset},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Heimdall-v2.0}
}
```
---
## 9. Changelog
* **v2.0.0** (2025‑10‑06) — Expanded to **83 920** rows; deeper Cloud/DevSecOps/Identity coverage; stronger adversarial refusal tests; stricter schema checks.
* **v1.1.0** (2025‑06‑21) — 21 258 rows; broadened framework coverage; improved automatic quality gates.
* **v1.0.0** (2025‑06‑17) — Initial 2 500 rows.
---
license: Apache-2.0
task_categories:
- 文本生成
language:
- 英语
tags:
- 网络安全
- 防御安全
- 指令微调
size_categories:
- 10K<n<100K
dataset_info:
version: 1.1.0
# 网络安全防御指令微调数据集(v2.0)
<img src="https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0/resolve/main/Fenrir.png" width="700" />
作者:Alican Kiraz
## TL;DR 简短总结
这是一个可直接用于训练的高质量数据集,包含**83920**条高质量的*系统/用户/助手*三元组,用于**防御性、对齐安全的网络监督微调(Supervised Fine-Tuning,SFT)**训练。采用Apache-2.0许可证,可直接投入生产使用。**覆盖范围**:OWASP Top 10、MITRE ATT&CK、NIST CSF、CIS控制措施、ASD Essential 8、现代身份认证(OAuth 2 / OIDC / SAML)、SSL / TLS、云与DevSecOps、密码学以及AI安全。
## 1 v2.0版本更新内容(2025-10-06)
| 变更项 | v1.1.0 | **v2.0.0** |
| ----------------- | ------------------------------------: | ----------------------------------------------------------------------------------------------: |
| **样本数量** | 21 258 | **83 920** *(≈ 4×)* |
| **覆盖范围** | OWASP、NIST CSF、附加MITRE ATT&CK、CIS | + 深度云安全(AWS/Azure/GCP)、现代身份认证加固、密码学规范、AI安全交互 |
| **质量管控门槛** | 去重、PII清洗、幻觉检测 | + 对抗性拒绝测试(越狱/注入)、静态策略lint检查、内容风险分类法 |
| **格式** | Parquet(对话三元组格式) | + 更严格的架构校验、稳定行ID |
| **许可证** | Apache-2.0 | Apache-2.0 |
**核心亮点**
* **大规模**:83920条对话样本,内容达到**资深安全工程师**级别的专业解释。
* **设计安全**:内置**拒绝响应模式**与*对齐*校验机制,可抵御恶意或恶意利用请求。
* **适配标准框架**:内容与行业标准(OWASP / ATT&CK / NIST / CIS)进行了映射绑定。
* **优先覆盖云与DevSecOps**:涵盖身份与访问管理(Identity and Access Management,IAM)、密钥管理服务(Key Management Service,KMS)、供应链安全、CI/CD签名、容器与Kubernetes加固、日志/安全信息和事件管理(Security Information and Event Management,SIEM)、事件响应。
* **商业友好**:采用**Apache-2.0**许可证。
## 2 数据集摘要
| 属性 | 说明 |
| ------------ | ------------------------------------------------------ |
| **语言** | 英语 |
| **许可证** | Apache-2.0 |
| **格式** | Parquet(列式存储) |
| **样本行数** | **83 920** |
| **列字段** | `system`、`user`、`assistant` |
| **数据拆分** | 仅包含训练集(100 %) |
**对话三元组记录架构**
json
{
"system": "You are a seasoned cyber‑defense AI that follows industry ethics...",
"user": "Compare mitigations for Reflected vs Stored XSS in a modern SPA.",
"assistant": "Reflected and Stored XSS share core mitigation pillars—output encoding..."
}
## 3 覆盖范围与设计
### 3.1 领域与框架
* **应用安全与Web安全**:OWASP Top 10、安全编码、输入/输出处理、服务器端请求伪造(Server-Side Request Forgery,SSRF)、反序列化。
* **云安全**:IAM防护规则、最小权限原则、密钥轮转、硬件安全模块(Hardware Security Module,HSM)、网络分段、态势管理。
* **DevSecOps**:静态应用安全测试(Static Application Security Testing,SAST)/动态应用安全测试(Dynamic Application Security Testing,DAST)、软件物料清单(Software Bill of Materials,SBOM)、供应链安全、CI/CD签名、容器与Kubernetes加固。
* **身份与访问管理**:OAuth2/OIDC/SAML、多因素认证(Multi-Factor Authentication,MFA)/抗钓鱼认证、会话管理。
* **密码学规范**:TLS配置、高级加密标准认证加密带关联数据(Authenticated Encryption with Associated Data,AEAD)模式、密钥生命周期、随机数生成、密码哈希。
* **检测与响应**:日志、SIEM关联分析、威胁狩猎、事件响应手册。
* **AI安全交互**:提示注入防护、数据投毒认知、模型滥用拒绝机制。
### 3.2 指令风格
涵盖对比分析、分步加固指南、检查清单、“失败原因”根本原因分析、策略依据、权衡分析,以及针对两用场景的“带解释拒绝”模式。
## 4 数据创建与质量流程
1. **源数据采集**:超过25万份公开技术文档(行业标准、RFC、白皮书、厂商指南)。
2. **文本抽取**:剥离冗余格式、语言检测、启发式段落分割。
3. **主题过滤**:通过关键词+嵌入检索,仅保留防御安全相关内容。
4. **指令合成**:生成`system/user/assistant`三元组;强制遵循伦理与拒绝响应模板。
5. **多层质量管控**
* **去重**:采用MinHash + LSH聚类剪枝。
* **PII与不雅内容清洗**。
* **幻觉/不一致性检测**(基于大语言模型辅助)。
* **拒绝模式测试**:越狱与提示注入触发测试;不包含漏洞利用步骤。
* **人工抽样审核**(约3%的样本)。
## 5 伦理使用与安全规范
* **两用风险**:数据集刻意规避漏洞利用代码编写;针对恶意请求将返回**解释性拒绝响应**。
* **偏差控制**:聚焦主流通用框架(OWASP/NIST/CIS)。
* *路线规划*:后续将支持更多区域标准(如ISO/IEC、GDPR安全控制)。
* **数据溯源**:仅使用公开来源;尊重原作品许可证;最终数据集采用**Apache-2.0**许可证发布。
## 6 局限性
* 仅支持英语。
* 以防御视角为主;红队战术仅用于加固场景说明。
* 网络安全领域迭代迅速;计划定期更新数据集。
## 7 示例记录
**加固检查清单**:包含加固步骤、依据、陷阱及标准引用。
**拒绝响应示例**:明确拒绝恶意软件/漏洞利用构建请求,并提供安全替代方案(如日志记录、检测、补丁修复)。
> 所有示例均遵循`system/user/assistant`架构,且经过对齐安全设计。
## 8 引用
bibtex
@dataset{alican_kiraz_2025_heimdall_v2_0,
author = {Alican Kiraz},
title = {Fenrir v2.0 — 网络安全防御指令微调数据集},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Heimdall-v2.0}
}
## 9 更新日志
* **v2.0.0(2025-10-06)**:样本量扩充至**83920**条;新增深度云/DevSecOps/身份管理覆盖;强化对抗性拒绝测试;新增更严格的架构校验。
* **v1.1.0(2025-06-21)**:包含21258条样本;扩展框架覆盖范围;优化自动质量管控流程。
* **v1.0.0(2025-06-17)**:初始版本,包含2500条样本。
提供机构:
invinciblejha01



