five

invinciblejha01/Cybersecurity-Dataset-Fenrir-v2.0

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/invinciblejha01/Cybersecurity-Dataset-Fenrir-v2.0
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - cybersecurity - defensive-security - instruction-tuning size_categories: - 10K<n<100K dataset_info: version: 1.1.0 --- # Cybersecurity Defense Instruction-Tuning Dataset (v2.0) <img src="https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0/resolve/main/Fenrir.png" width="700" /> Created by Alican Kiraz ## TL;DR A ready-to-train dataset of **83,920** high-quality *system / user / assistant* triples for **defensive, alignment-safe cybersecurity SFT** training. Apache-2.0 licensed and production-ready. **Scope:** OWASP Top 10, MITRE ATT&CK, NIST CSF, CIS Controls, ASD Essential 8, modern authentication (OAuth 2 / OIDC / SAML), SSL / TLS, Cloud & DevSecOps, Cryptography, and AI Security. --- ## 1  What’s new in v2.0  (2025‑10‑06) | Change | v1.1.0 | **v2.0.0** | | ----------------- | ------------------------------------: | ----------------------------------------------------------------------------------------------: | | **Rows** | 21 258 | **83 920** *(≈ 4×)* | | **Coverage** | OWASP, NIST CSF, +MITRE ATT&CK, CIS | + Deeper Cloud (AWS/Azure/GCP), modern auth hardening, crypto hygiene, AI‑security interplay | | **Quality gates** | Dedup, PII scrub, hallucination scans | + adversarial refusal tests (jailbreak/injection), static policy linting, content risk taxonomy | | **Format** | Parquet (chat triples) | + stricter schema checks, stable row IDs | | **License** | Apache‑2.0 | Apache‑2.0 | **Highlights** * **Big**: 83,920 chat samples with explanations at **senior security engineer** level. * **Safe-by-design**: Built-in **rejection patterns** and *alignment* checks against malicious or exploit requests. * **Framework-aware**: Content is **mapped to standards** (OWASP / ATT&CK / NIST / CIS). * **Cloud & DevSecOps first**: IAM, secrets, CI/CD, container / k8s hardening, logging / SIEM, incident response. * **Commercial-friendly**: **Apache-2.0** licensed. --- ## 2  Dataset Summary | Property | Value | | ------------ | ------------------------------------------------------ | | **Language** | English | | **License** | Apache‑2.0 | | **Format** | Parquet (columnar) | | **Rows** | **83 920** | | **Columns** | `system`, `user`, `assistant` | | **Split** | `train` (100 %) | **Record schema (chat triple)** ```json { "system": "You are a seasoned cyber‑defense AI that follows industry ethics...", "user": "Compare mitigations for Reflected vs Stored XSS in a modern SPA.", "assistant": "Reflected and Stored XSS share core mitigation pillars—output encoding..." } ``` --- ## 3. Coverage & Design ### 3.1 Domains & Frameworks * **AppSec & Web**: OWASP Top 10, secure coding, input/output handling, SSRF, deserialization. * **Cloud Security**: IAM guardrails, least privilege, key rotation, KMS/HSM, network segmentation, posture mgmt. * **DevSecOps**: SAST/DAST, SBOM, supply‑chain, CI/CD signing, container & Kubernetes hardening. * **Identity & Access**: OAuth2/OIDC/SAML, MFA/Phishing‑resistant auth, session mgmt. * **Crypto Hygiene**: TLS configs, AEAD modes, key lifecycle, randomness, password hashing. * **Detection & Response**: logging, SIEM correlation, threat hunting, IR playbooks. * **AI‑Security Interplay**: prompt injection defense, data‑poisoning awareness, model‑misuse refusals. ### 3.2 Instruction styles * Compare/contrast, step‑by‑step mitigation, checklists, “why it fails” root‑cause analyses, policy rationale, trade‑offs, and “refuse with explanation” for dual‑use prompts. --- ## 4. Data Creation & Quality 1. **Source harvesting**: 250 k+ public technical docs (standards, RFCs, white‑papers, vendor guidance). 2. **Extraction**: boilerplate stripping, language detection, heuristic paragraph segmentation. 3. **Topical filtering**: keyword+embedding retrieval towards defensive security only. 4. **Instruction synthesis**: prompts → *system/user/assistant*; enforced ethics & refusal templates. 5. **Quality gates** *(multi‑layer)* * **Deduplication**: MinHash + LSH cluster pruning. * **PII & profanity scrub**. * **Hallucination/inconsistency scans** (LLM‑aided). * **Refusal‑pattern tests**: jailbreak & prompt‑injection triggers; no exploit‑building steps. * **Manual spot review** (~3 % sample). --- ## 5. Ethical Use & Safety * **Dual‑use risk**: Dataset intentionally avoids exploit crafting; offensive requests receive **explanatory refusals**. * **Bias**: Focus on widely used frameworks (OWASP/NIST/CIS). * *Roadmap*: more regional standards (e.g., ISO/IEC, GDPR security controls). * **Provenance**: Only public sources; licensing respected; outputs released under **Apache‑2.0**. --- ## 6. Limitations * English‑only. * Predominantly defensive stance; red‑team tactics only for mitigation context. * Security evolves rapidly; periodic refresh planned. --- ## 7. Example Records **Mitigation checklist:** hardening steps, rationales, pitfalls, references to standards. **Refusal sample:** clearly declines malware/exploit construction with safe alternatives (logging, detection, patching). > *All examples adhere to the `system/user/assistant` schema and are engineered to be alignment‑safe.* --- ## 8. Citation ```bibtex @dataset{alican_kiraz_2025_heimdall_v2_0, author = {Alican Kiraz}, title = {Fenrir v2.0 — Cybersecurity Defense Instruction-Tuning Dataset}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Heimdall-v2.0} } ``` --- ## 9. Changelog * **v2.0.0** (2025‑10‑06) — Expanded to **83 920** rows; deeper Cloud/DevSecOps/Identity coverage; stronger adversarial refusal tests; stricter schema checks. * **v1.1.0** (2025‑06‑21) — 21 258 rows; broadened framework coverage; improved automatic quality gates. * **v1.0.0** (2025‑06‑17) — Initial 2 500 rows. ---

license: Apache-2.0 task_categories: - 文本生成 language: - 英语 tags: - 网络安全 - 防御安全 - 指令微调 size_categories: - 10K<n<100K dataset_info: version: 1.1.0 # 网络安全防御指令微调数据集(v2.0) <img src="https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0/resolve/main/Fenrir.png" width="700" /> 作者:Alican Kiraz ## TL;DR 简短总结 这是一个可直接用于训练的高质量数据集,包含**83920**条高质量的*系统/用户/助手*三元组,用于**防御性、对齐安全的网络监督微调(Supervised Fine-Tuning,SFT)**训练。采用Apache-2.0许可证,可直接投入生产使用。**覆盖范围**:OWASP Top 10、MITRE ATT&CK、NIST CSF、CIS控制措施、ASD Essential 8、现代身份认证(OAuth 2 / OIDC / SAML)、SSL / TLS、云与DevSecOps、密码学以及AI安全。 ## 1 v2.0版本更新内容(2025-10-06) | 变更项 | v1.1.0 | **v2.0.0** | | ----------------- | ------------------------------------: | ----------------------------------------------------------------------------------------------: | | **样本数量** | 21 258 | **83 920** *(≈ 4×)* | | **覆盖范围** | OWASP、NIST CSF、附加MITRE ATT&CK、CIS | + 深度云安全(AWS/Azure/GCP)、现代身份认证加固、密码学规范、AI安全交互 | | **质量管控门槛** | 去重、PII清洗、幻觉检测 | + 对抗性拒绝测试(越狱/注入)、静态策略lint检查、内容风险分类法 | | **格式** | Parquet(对话三元组格式) | + 更严格的架构校验、稳定行ID | | **许可证** | Apache-2.0 | Apache-2.0 | **核心亮点** * **大规模**:83920条对话样本,内容达到**资深安全工程师**级别的专业解释。 * **设计安全**:内置**拒绝响应模式**与*对齐*校验机制,可抵御恶意或恶意利用请求。 * **适配标准框架**:内容与行业标准(OWASP / ATT&CK / NIST / CIS)进行了映射绑定。 * **优先覆盖云与DevSecOps**:涵盖身份与访问管理(Identity and Access Management,IAM)、密钥管理服务(Key Management Service,KMS)、供应链安全、CI/CD签名、容器与Kubernetes加固、日志/安全信息和事件管理(Security Information and Event Management,SIEM)、事件响应。 * **商业友好**:采用**Apache-2.0**许可证。 ## 2 数据集摘要 | 属性 | 说明 | | ------------ | ------------------------------------------------------ | | **语言** | 英语 | | **许可证** | Apache-2.0 | | **格式** | Parquet(列式存储) | | **样本行数** | **83 920** | | **列字段** | `system`、`user`、`assistant` | | **数据拆分** | 仅包含训练集(100 %) | **对话三元组记录架构** json { "system": "You are a seasoned cyber‑defense AI that follows industry ethics...", "user": "Compare mitigations for Reflected vs Stored XSS in a modern SPA.", "assistant": "Reflected and Stored XSS share core mitigation pillars—output encoding..." } ## 3 覆盖范围与设计 ### 3.1 领域与框架 * **应用安全与Web安全**:OWASP Top 10、安全编码、输入/输出处理、服务器端请求伪造(Server-Side Request Forgery,SSRF)、反序列化。 * **云安全**:IAM防护规则、最小权限原则、密钥轮转、硬件安全模块(Hardware Security Module,HSM)、网络分段、态势管理。 * **DevSecOps**:静态应用安全测试(Static Application Security Testing,SAST)/动态应用安全测试(Dynamic Application Security Testing,DAST)、软件物料清单(Software Bill of Materials,SBOM)、供应链安全、CI/CD签名、容器与Kubernetes加固。 * **身份与访问管理**:OAuth2/OIDC/SAML、多因素认证(Multi-Factor Authentication,MFA)/抗钓鱼认证、会话管理。 * **密码学规范**:TLS配置、高级加密标准认证加密带关联数据(Authenticated Encryption with Associated Data,AEAD)模式、密钥生命周期、随机数生成、密码哈希。 * **检测与响应**:日志、SIEM关联分析、威胁狩猎、事件响应手册。 * **AI安全交互**:提示注入防护、数据投毒认知、模型滥用拒绝机制。 ### 3.2 指令风格 涵盖对比分析、分步加固指南、检查清单、“失败原因”根本原因分析、策略依据、权衡分析,以及针对两用场景的“带解释拒绝”模式。 ## 4 数据创建与质量流程 1. **源数据采集**:超过25万份公开技术文档(行业标准、RFC、白皮书、厂商指南)。 2. **文本抽取**:剥离冗余格式、语言检测、启发式段落分割。 3. **主题过滤**:通过关键词+嵌入检索,仅保留防御安全相关内容。 4. **指令合成**:生成`system/user/assistant`三元组;强制遵循伦理与拒绝响应模板。 5. **多层质量管控** * **去重**:采用MinHash + LSH聚类剪枝。 * **PII与不雅内容清洗**。 * **幻觉/不一致性检测**(基于大语言模型辅助)。 * **拒绝模式测试**:越狱与提示注入触发测试;不包含漏洞利用步骤。 * **人工抽样审核**(约3%的样本)。 ## 5 伦理使用与安全规范 * **两用风险**:数据集刻意规避漏洞利用代码编写;针对恶意请求将返回**解释性拒绝响应**。 * **偏差控制**:聚焦主流通用框架(OWASP/NIST/CIS)。 * *路线规划*:后续将支持更多区域标准(如ISO/IEC、GDPR安全控制)。 * **数据溯源**:仅使用公开来源;尊重原作品许可证;最终数据集采用**Apache-2.0**许可证发布。 ## 6 局限性 * 仅支持英语。 * 以防御视角为主;红队战术仅用于加固场景说明。 * 网络安全领域迭代迅速;计划定期更新数据集。 ## 7 示例记录 **加固检查清单**:包含加固步骤、依据、陷阱及标准引用。 **拒绝响应示例**:明确拒绝恶意软件/漏洞利用构建请求,并提供安全替代方案(如日志记录、检测、补丁修复)。 > 所有示例均遵循`system/user/assistant`架构,且经过对齐安全设计。 ## 8 引用 bibtex @dataset{alican_kiraz_2025_heimdall_v2_0, author = {Alican Kiraz}, title = {Fenrir v2.0 — 网络安全防御指令微调数据集}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Heimdall-v2.0} } ## 9 更新日志 * **v2.0.0(2025-10-06)**:样本量扩充至**83920**条;新增深度云/DevSecOps/身份管理覆盖;强化对抗性拒绝测试;新增更严格的架构校验。 * **v1.1.0(2025-06-21)**:包含21258条样本;扩展框架覆盖范围;优化自动质量管控流程。 * **v1.0.0(2025-06-17)**:初始版本,包含2500条样本。
提供机构:
invinciblejha01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作