Name: invinciblejha01/Cybersecurity-Dataset-Fenrir-v2.0
Creator: invinciblejha01
Published: 2026-04-14 16:45:10
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/invinciblejha01/Cybersecurity-Dataset-Fenrir-v2.0

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en tags: - cybersecurity - defensive-security - instruction-tuning size_categories: - 10K<n<100K dataset_info: version: 1.1.0 --- # Cybersecurity Defense Instruction-Tuning Dataset (v2.0) <img src="https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0/resolve/main/Fenrir.png" width="700" /> Created by Alican Kiraz ## TL;DR A ready-to-train dataset of **83,920** high-quality *system / user / assistant* triples for **defensive, alignment-safe cybersecurity SFT** training. Apache-2.0 licensed and production-ready. **Scope:** OWASP Top 10, MITRE ATT&CK, NIST CSF, CIS Controls, ASD Essential 8, modern authentication (OAuth 2 / OIDC / SAML), SSL / TLS, Cloud & DevSecOps, Cryptography, and AI Security. --- ## 1 What’s new in v2.0 (2025‑10‑06) | Change | v1.1.0 | **v2.0.0** | | ----------------- | ------------------------------------: | ----------------------------------------------------------------------------------------------: | | **Rows** | 21 258 | **83 920** *(≈ 4×)* | | **Coverage** | OWASP, NIST CSF, +MITRE ATT&CK, CIS | + Deeper Cloud (AWS/Azure/GCP), modern auth hardening, crypto hygiene, AI‑security interplay | | **Quality gates** | Dedup, PII scrub, hallucination scans | + adversarial refusal tests (jailbreak/injection), static policy linting, content risk taxonomy | | **Format** | Parquet (chat triples) | + stricter schema checks, stable row IDs | | **License** | Apache‑2.0 | Apache‑2.0 | **Highlights** * **Big**: 83,920 chat samples with explanations at **senior security engineer** level. * **Safe-by-design**: Built-in **rejection patterns** and *alignment* checks against malicious or exploit requests. * **Framework-aware**: Content is **mapped to standards** (OWASP / ATT&CK / NIST / CIS). * **Cloud & DevSecOps first**: IAM, secrets, CI/CD, container / k8s hardening, logging / SIEM, incident response. * **Commercial-friendly**: **Apache-2.0** licensed. --- ## 2 Dataset Summary | Property | Value | | ------------ | ------------------------------------------------------ | | **Language** | English | | **License** | Apache‑2.0 | | **Format** | Parquet (columnar) | | **Rows** | **83 920** | | **Columns** | `system`, `user`, `assistant` | | **Split** | `train` (100 %) | **Record schema (chat triple)** ```json { "system": "You are a seasoned cyber‑defense AI that follows industry ethics...", "user": "Compare mitigations for Reflected vs Stored XSS in a modern SPA.", "assistant": "Reflected and Stored XSS share core mitigation pillars—output encoding..." } ``` --- ## 3. Coverage & Design ### 3.1 Domains & Frameworks * **AppSec & Web**: OWASP Top 10, secure coding, input/output handling, SSRF, deserialization. * **Cloud Security**: IAM guardrails, least privilege, key rotation, KMS/HSM, network segmentation, posture mgmt. * **DevSecOps**: SAST/DAST, SBOM, supply‑chain, CI/CD signing, container & Kubernetes hardening. * **Identity & Access**: OAuth2/OIDC/SAML, MFA/Phishing‑resistant auth, session mgmt. * **Crypto Hygiene**: TLS configs, AEAD modes, key lifecycle, randomness, password hashing. * **Detection & Response**: logging, SIEM correlation, threat hunting, IR playbooks. * **AI‑Security Interplay**: prompt injection defense, data‑poisoning awareness, model‑misuse refusals. ### 3.2 Instruction styles * Compare/contrast, step‑by‑step mitigation, checklists, “why it fails” root‑cause analyses, policy rationale, trade‑offs, and “refuse with explanation” for dual‑use prompts. --- ## 4. Data Creation & Quality 1. **Source harvesting**: 250 k+ public technical docs (standards, RFCs, white‑papers, vendor guidance). 2. **Extraction**: boilerplate stripping, language detection, heuristic paragraph segmentation. 3. **Topical filtering**: keyword+embedding retrieval towards defensive security only. 4. **Instruction synthesis**: prompts → *system/user/assistant*; enforced ethics & refusal templates. 5. **Quality gates** *(multi‑layer)* * **Deduplication**: MinHash + LSH cluster pruning. * **PII & profanity scrub**. * **Hallucination/inconsistency scans** (LLM‑aided). * **Refusal‑pattern tests**: jailbreak & prompt‑injection triggers; no exploit‑building steps. * **Manual spot review** (~3 % sample). --- ## 5. Ethical Use & Safety * **Dual‑use risk**: Dataset intentionally avoids exploit crafting; offensive requests receive **explanatory refusals**. * **Bias**: Focus on widely used frameworks (OWASP/NIST/CIS). * *Roadmap*: more regional standards (e.g., ISO/IEC, GDPR security controls). * **Provenance**: Only public sources; licensing respected; outputs released under **Apache‑2.0**. --- ## 6. Limitations * English‑only. * Predominantly defensive stance; red‑team tactics only for mitigation context. * Security evolves rapidly; periodic refresh planned. --- ## 7. Example Records **Mitigation checklist:** hardening steps, rationales, pitfalls, references to standards. **Refusal sample:** clearly declines malware/exploit construction with safe alternatives (logging, detection, patching). > *All examples adhere to the `system/user/assistant` schema and are engineered to be alignment‑safe.* --- ## 8. Citation ```bibtex @dataset{alican_kiraz_2025_heimdall_v2_0, author = {Alican Kiraz}, title = {Fenrir v2.0 — Cybersecurity Defense Instruction-Tuning Dataset}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Heimdall-v2.0} } ``` --- ## 9. Changelog * **v2.0.0** (2025‑10‑06) — Expanded to **83 920** rows; deeper Cloud/DevSecOps/Identity coverage; stronger adversarial refusal tests; stricter schema checks. * **v1.1.0** (2025‑06‑21) — 21 258 rows; broadened framework coverage; improved automatic quality gates. * **v1.0.0** (2025‑06‑17) — Initial 2 500 rows. ---

license: Apache-2.0 task_categories: - 文本生成 language: - 英语 tags: - 网络安全 - 防御安全 - 指令微调 size_categories: - 10K<n<100K dataset_info: version: 1.1.0 # 网络安全防御指令微调数据集（v2.0） <img src="https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0/resolve/main/Fenrir.png" width="700" /> 作者：Alican Kiraz ## TL;DR 简短总结这是一个可直接用于训练的高质量数据集，包含**83920**条高质量的*系统/用户/助手*三元组，用于**防御性、对齐安全的网络监督微调（Supervised Fine-Tuning，SFT）**训练。采用Apache-2.0许可证，可直接投入生产使用。**覆盖范围**：OWASP Top 10、MITRE ATT&CK、NIST CSF、CIS控制措施、ASD Essential 8、现代身份认证（OAuth 2 / OIDC / SAML）、SSL / TLS、云与DevSecOps、密码学以及AI安全。 ## 1 v2.0版本更新内容（2025-10-06） | 变更项 | v1.1.0 | **v2.0.0** | | ----------------- | ------------------------------------: | ----------------------------------------------------------------------------------------------: | | **样本数量** | 21 258 | **83 920** *(≈ 4×)* | | **覆盖范围** | OWASP、NIST CSF、附加MITRE ATT&CK、CIS | + 深度云安全（AWS/Azure/GCP）、现代身份认证加固、密码学规范、AI安全交互 | | **质量管控门槛** | 去重、PII清洗、幻觉检测 | + 对抗性拒绝测试（越狱/注入）、静态策略lint检查、内容风险分类法 | | **格式** | Parquet（对话三元组格式） | + 更严格的架构校验、稳定行ID | | **许可证** | Apache-2.0 | Apache-2.0 | **核心亮点** * **大规模**：83920条对话样本，内容达到**资深安全工程师**级别的专业解释。 * **设计安全**：内置**拒绝响应模式**与*对齐*校验机制，可抵御恶意或恶意利用请求。 * **适配标准框架**：内容与行业标准（OWASP / ATT&CK / NIST / CIS）进行了映射绑定。 * **优先覆盖云与DevSecOps**：涵盖身份与访问管理（Identity and Access Management，IAM）、密钥管理服务（Key Management Service，KMS）、供应链安全、CI/CD签名、容器与Kubernetes加固、日志/安全信息和事件管理（Security Information and Event Management，SIEM）、事件响应。 * **商业友好**：采用**Apache-2.0**许可证。 ## 2 数据集摘要 | 属性 | 说明 | | ------------ | ------------------------------------------------------ | | **语言** | 英语 | | **许可证** | Apache-2.0 | | **格式** | Parquet（列式存储） | | **样本行数** | **83 920** | | **列字段** | `system`、`user`、`assistant` | | **数据拆分** | 仅包含训练集（100 %） | **对话三元组记录架构** json { "system": "You are a seasoned cyber‑defense AI that follows industry ethics...", "user": "Compare mitigations for Reflected vs Stored XSS in a modern SPA.", "assistant": "Reflected and Stored XSS share core mitigation pillars—output encoding..." } ## 3 覆盖范围与设计 ### 3.1 领域与框架 * **应用安全与Web安全**：OWASP Top 10、安全编码、输入/输出处理、服务器端请求伪造（Server-Side Request Forgery，SSRF）、反序列化。 * **云安全**：IAM防护规则、最小权限原则、密钥轮转、硬件安全模块（Hardware Security Module，HSM）、网络分段、态势管理。 * **DevSecOps**：静态应用安全测试（Static Application Security Testing，SAST）/动态应用安全测试（Dynamic Application Security Testing，DAST）、软件物料清单（Software Bill of Materials，SBOM）、供应链安全、CI/CD签名、容器与Kubernetes加固。 * **身份与访问管理**：OAuth2/OIDC/SAML、多因素认证（Multi-Factor Authentication，MFA）/抗钓鱼认证、会话管理。 * **密码学规范**：TLS配置、高级加密标准认证加密带关联数据（Authenticated Encryption with Associated Data，AEAD）模式、密钥生命周期、随机数生成、密码哈希。 * **检测与响应**：日志、SIEM关联分析、威胁狩猎、事件响应手册。 * **AI安全交互**：提示注入防护、数据投毒认知、模型滥用拒绝机制。 ### 3.2 指令风格涵盖对比分析、分步加固指南、检查清单、“失败原因”根本原因分析、策略依据、权衡分析，以及针对两用场景的“带解释拒绝”模式。 ## 4 数据创建与质量流程 1. **源数据采集**：超过25万份公开技术文档（行业标准、RFC、白皮书、厂商指南）。 2. **文本抽取**：剥离冗余格式、语言检测、启发式段落分割。 3. **主题过滤**：通过关键词+嵌入检索，仅保留防御安全相关内容。 4. **指令合成**：生成`system/user/assistant`三元组；强制遵循伦理与拒绝响应模板。 5. **多层质量管控** * **去重**：采用MinHash + LSH聚类剪枝。 * **PII与不雅内容清洗**。 * **幻觉/不一致性检测**（基于大语言模型辅助）。 * **拒绝模式测试**：越狱与提示注入触发测试；不包含漏洞利用步骤。 * **人工抽样审核**（约3%的样本）。 ## 5 伦理使用与安全规范 * **两用风险**：数据集刻意规避漏洞利用代码编写；针对恶意请求将返回**解释性拒绝响应**。 * **偏差控制**：聚焦主流通用框架（OWASP/NIST/CIS）。 * *路线规划*：后续将支持更多区域标准（如ISO/IEC、GDPR安全控制）。 * **数据溯源**：仅使用公开来源；尊重原作品许可证；最终数据集采用**Apache-2.0**许可证发布。 ## 6 局限性 * 仅支持英语。 * 以防御视角为主；红队战术仅用于加固场景说明。 * 网络安全领域迭代迅速；计划定期更新数据集。 ## 7 示例记录 **加固检查清单**：包含加固步骤、依据、陷阱及标准引用。 **拒绝响应示例**：明确拒绝恶意软件/漏洞利用构建请求，并提供安全替代方案（如日志记录、检测、补丁修复）。 > 所有示例均遵循`system/user/assistant`架构，且经过对齐安全设计。 ## 8 引用 bibtex @dataset{alican_kiraz_2025_heimdall_v2_0, author = {Alican Kiraz}, title = {Fenrir v2.0 — 网络安全防御指令微调数据集}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Heimdall-v2.0} } ## 9 更新日志 * **v2.0.0（2025-10-06）**：样本量扩充至**83920**条；新增深度云/DevSecOps/身份管理覆盖；强化对抗性拒绝测试；新增更严格的架构校验。 * **v1.1.0（2025-06-21）**：包含21258条样本；扩展框架覆盖范围；优化自动质量管控流程。 * **v1.0.0（2025-06-17）**：初始版本，包含2500条样本。

应用场景：