five

invinciblejha01/Trendyol-Cybersecurity-Instruction-Tuning-Dataset

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/invinciblejha01/Trendyol-Cybersecurity-Instruction-Tuning-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en tags: - cybersecurity - defensive-security - instruction-tuning - threat-intelligence - incident-response - security-operations pretty_name: Trendyol Cybersecurity Defense Dataset size_categories: - 10K<n<100K dataset_info: version: 1.0.0 --- # Trendyol Cybersecurity Defense Instruction-Tuning Dataset (v2.0) <div align="center"> <img src="https://img.shields.io/badge/rows-53,202-brightgreen" alt="Dataset Size"> <img src="https://img.shields.io/badge/license-Apache%202.0-blue" alt="License"> <img src="https://img.shields.io/badge/language-English-red" alt="Language"> <img src="https://img.shields.io/badge/version-2.0.0-orange" alt="Version"> </div> ## 🚀 TL;DR **53,202** meticulously curated *system/user/assistant* instruction-tuning examples covering **200+ specialized cybersecurity domains**. Built by the Trendyol Security Team for training state-of-the-art defensive security AI assistants. Expanded from 21K to 53K rows with comprehensive coverage of modern security challenges including cloud-native threats, AI/ML security, quantum computing risks, and advanced incident response techniques. --- ## 📊 What's New in v2.0 (2025-07-30) | Metric | v1.1 | v2.0 | Change | |--------|------|------|--------| | **Total Rows** | 21,258 | **53,202** | +150.3% | | **Unique Topics** | 50+ | **200+** | +300% | | **Coverage Depth** | Basic-Intermediate | **Basic-Expert** | Enhanced | | **Specialized Domains** | Traditional Security | **+ AI/ML, Quantum, Cloud-Native, OT/ICS** | Expanded | | **Framework Integration** | MITRE ATT&CK, NIST | **+ STIX/TAXII, Diamond Model, Zero Trust** | Comprehensive | | **Platform Specific** | Generic | **+ macOS, Cloud Providers, Container Orchestration** | Targeted | ### 🎯 Major Additions in v2.0 - **Advanced Threat Intelligence**: 5G networks, AI-powered analysis, quantum computing threats - **Cloud-Native Security**: Kubernetes forensics, serverless security, multi-cloud environments - **Emerging Technologies**: Post-quantum cryptography, DNA computing security, metamaterial computing - **Platform-Specific**: Deep macOS security analysis, cloud provider-specific forensics - **Operational Excellence**: SOAR automation, threat hunting metrics, incident response orchestration --- ## 📋 Dataset Summary | Property | Value | |----------|-------| | **Language** | English | | **License** | Apache 2.0 | | **Format** | Parquet (optimized columnar storage) | | **Total Rows** | 53,202 | | **Columns** | `system`, `user`, `assistant` | | **Splits** | `train` (90%), `validation` (5%), `test` (5%) | | **Average Response Length** | ~700 tokens | | **Compression Ratio** | 0.72 | ### 📊 Topic Distribution ``` Cloud Security & DevSecOps : 18.5% Threat Intelligence & Hunting : 16.2% Incident Response & Forensics : 14.8% AI/ML Security : 12.3% Network & Protocol Security : 11.7% Identity & Access Management : 9.4% Emerging Technologies : 8.6% Platform-Specific Security : 5.3% Compliance & Governance : 3.2% ``` --- ## 🏗️ Dataset Structure ### Fields Description | Field | Type | Description | Example | |-------|------|-------------|---------| | `system` | *string* | Role definition with ethical guidelines | "You are an expert cybersecurity professional..." | | `user` | *string* | Realistic security question/scenario | "How can I detect API gateway abuse in microservices?" | | `assistant` | *string* | Comprehensive technical response | "API gateway abuse detection requires multi-layered..." | ### Data Splits ```python { "train": 47,882, # 90% "validation": 2,660, # 5% "test": 2,660 # 5% } ``` --- ## 🔬 Dataset Creation Process ### 1. **Advanced Content Curation** (500K+ sources) - Technical blogs, security advisories, CVE databases - Academic papers, conference proceedings (BlackHat, DEF CON, RSA) - Industry reports, threat intelligence feeds - Platform-specific documentation (AWS, Azure, GCP, macOS) - Regulatory frameworks and compliance standards ### 2. **Multi-Stage Processing Pipeline** ``` Raw Content → Language Detection → Topic Classification → Instruction Synthesis → Quality Validation → Expert Review → Ethical Filtering → Final Dataset ``` ### 3. **Quality Assurance Framework** - **Automated Checks**: Grammar, technical accuracy, response completeness - **Deduplication**: Advanced MinHash LSH with semantic similarity - **Hallucination Detection**: Fact-checking against authoritative sources - **Ethical Compliance**: Offensive content filtering, dual-use prevention - **Expert Validation**: 10% manual review by security professionals ### 4. **Topic Coverage Validation** - Comprehensive mapping to industry frameworks (MITRE ATT&CK, NIST, ISO 27001) - Cross-reference with current threat landscape report1 - Validation against real-world incident patterns --- ## 💻 Usage Examples ### Basic Loading ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="train") # Load specific split val_dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="validation") # First example print(f"System: {dataset[0]['system']}") print(f"User: {dataset[0]['user']}") print(f"Assistant: {dataset[0]['assistant']}") ``` ### Fine-Tuning Configuration ```python from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from trl import SFTTrainer from peft import LoraConfig, get_peft_model # Model configuration model = AutoModelForCausalLM.from_pretrained("base-model-name", load_in_8bit=True) tokenizer = AutoTokenizer.from_pretrained("base-model-name") # LoRA configuration for efficient fine-tuning peft_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.1, bias="none", task_type="CAUSAL_LM", target_modules=["q_proj", "v_proj", "k_proj", "o_proj"] ) # Training configuration training_args = TrainingArguments( output_dir="./cybersec-finetuned", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, warmup_steps=100, logging_steps=25, save_strategy="epoch", evaluation_strategy="epoch", learning_rate=2e-4, bf16=True, gradient_checkpointing=True, ) # Initialize trainer trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset, tokenizer=tokenizer, peft_config=peft_config, max_seq_length=4096, dataset_text_field="text", # Concatenated field ) ``` --- ## 🎯 Specialized Coverage Areas ### 🔐 Advanced Topics Included 1. **Cloud-Native Security** - Multi-cloud forensics and incident response - Container and Kubernetes security - Serverless and FaaS security patterns - Cloud-native application protection (CNAPP) 2. **AI/ML Security** - Adversarial machine learning defense - Model poisoning detection - AI-powered threat intelligence - Federated learning for threat sharing 3. **Emerging Threats** - 5G network security and edge computing - Quantum computing threat landscape - Post-quantum cryptography implementation - Supply chain security automation 4. **Platform-Specific Security** - macOS security internals and forensics - Cloud provider-specific security controls - OT/ICS and critical infrastructure protection - Mobile and IoT security frameworks --- ## ⚖️ Ethical Considerations ### Responsible AI Guidelines - **Defensive Focus**: All content emphasizes protection and defense, never attack techniques - **Refusal Patterns**: Built-in responses for rejecting malicious requests - **Dual-Use Prevention**: Careful curation to avoid enabling harmful activities - **Privacy Protection**: No PII or sensitive organizational data included - **Bias Mitigation**: Balanced representation across vendors, platforms, and methodologies ### Usage Restrictions - Not for developing offensive security tools - Not for bypassing security controls - Not for unauthorized access or exploitation - Must comply with local laws and regulations --- ## 🚧 Known Limitations 1. **Language**: English-only (multilingual expansion planned) 2. **Temporal**: Knowledge cutoff varies by source (majority 2024-2025) 3. **Geographic Bias**: Western-centric frameworks and regulations 4. **Rapid Evolution**: Security landscape changes require regular updates 5. **Complexity Balance**: Some topics may be too advanced for general practitioners --- ## 📚 Citation ```bibtex @dataset{trendyol_2025_cybersec_v2, author = {{Trendyol Security Team}}, title = {Trendyol Cybersecurity Defense Instruction-Tuning Dataset v2.0}, year = {2025}, month = {7}, publisher = {Hugging Face}, version = {2.0.0}, } ``` --- ## 🤝 Contributing We welcome contributions from the security community! Please ensure: - ✅ Defensive security focus - ✅ Technical accuracy with references - ✅ Follows the dataset schema - ✅ Passes quality checks - ✅ Includes appropriate documentation --- ## 🙏 Acknowledgments Special thanks to the global cybersecurity community, security researchers, and open-source contributors who made this dataset possible. This work builds upon decades of collective knowledge in defensive security practices. --- ## 📜 Changelog - **v2.0.0** (2025-07-30): Major expansion to 53K+ examples, 200+ topics, platform-specific content --- <div align="center"> <i>Building a safer digital future through responsible AI and collaborative security intelligence.</i> </div>

--- license: Apache 2.0许可证 task_categories: - 文本生成 - 问答 language: - 英语 tags: - 网络安全 - 防御安全 - 指令微调(instruction-tuning) - 威胁情报(threat-intelligence) - 事件响应(incident-response) - 安全运营(security-operations) pretty_name: Trendyol网络安全防御数据集 size_categories: - 10K<n<100K dataset_info: version: 1.0.0 --- # Trendyol网络安全防御指令微调数据集(v2.0) <div align="center"> <img src="https://img.shields.io/badge/rows-53,202-brightgreen" alt="数据集规模"> <img src="https://img.shields.io/badge/license-Apache%202.0-blue" alt="许可证"> <img src="https://img.shields.io/badge/language-English-red" alt="语言"> <img src="https://img.shields.io/badge/version-2.0.0-orange" alt="版本"> </div> ## 🚀 核心摘要 **53,202条**精心遴选的*系统/用户/助手*格式指令微调示例,覆盖**200+细分网络安全领域**。该数据集由Trendyol安全团队打造,用于训练前沿防御安全AI智能体。数据集从21,000行扩展至53,000行,全面涵盖现代安全挑战,包括云原生威胁、AI/ML安全、量子计算风险与高级事件响应技术。 --- ## 📊 v2.0版本更新说明(2025-07-30) | 指标 | v1.1 | v2.0 | 变更幅度 | |--------|------|------|--------| | **总样本数** | 21,258 | **53,202** | +150.3% | | **唯一主题数** | 50+ | **200+** | +300% | | **覆盖深度** | 基础-中级 | **基础-专家级** | 全面升级 | | **专业领域** | 传统安全 | **新增AI/ML、量子计算、云原生、OT/ICS** | 大幅扩展 | | **框架集成** | MITRE ATT&CK、NIST | **新增STIX/TAXII、钻石模型(Diamond Model)、零信任(Zero Trust)** | 覆盖更全面 | | **特定平台支持** | 通用场景 | **新增macOS、云厂商、容器编排平台** | 针对性更强 | ### 🎯 v2.0主要新增内容 - **高级威胁情报**:5G网络、AI驱动分析、量子计算威胁 - **云原生安全**:Kubernetes取证、无服务器安全、多云环境 - **新兴技术安全**:后量子密码学、DNA计算安全、超材料计算安全 - **特定平台安全**:深度macOS安全分析、云厂商专属取证 - **运营优化**:安全编排自动化与响应(SOAR)自动化、威胁狩猎指标、事件响应编排 --- ## 📋 数据集概览 | 属性 | 取值 | |----------|-------| | **语言** | 英语 | | **许可证** | Apache 2.0 | | **存储格式** | Parquet(优化列式存储) | | **总样本数** | 53,202 | | **字段** | `system`、`user`、`assistant` | | **数据划分** | 训练集(90%)、验证集(5%)、测试集(5%) | | **平均响应长度** | ~700个Token(Token) | | **压缩比** | 0.72 | ### 📊 主题分布 云安全与DevSecOps : 18.5% 威胁情报与狩猎 : 16.2% 事件响应与取证 : 14.8% AI/ML安全 : 12.3% 网络与协议安全 : 11.7% 身份与访问管理 : 9.4% 新兴技术安全 : 8.6% 特定平台安全 : 5.3% 合规与治理 : 3.2% --- ## 🏗️ 数据集结构 ### 字段说明 | 字段 | 类型 | 描述 | 示例 | |-------|------|-------------|---------| | `system` | *字符串* | 角色定义与伦理准则 | "You are an expert cybersecurity professional..." | | `user` | *字符串* | 贴合实际的安全问题/场景 | "How can I detect API gateway abuse in microservices?" | | `assistant` | *字符串* | 全面的技术响应 | "API gateway abuse detection requires multi-layered..." | ### 数据划分 python { "train": 47,882, # 90% "validation": 2,660, # 5% "test": 2,660 # 5% } --- ## 🔬 数据集构建流程 ### 1. **高级内容遴选**(50万+数据源) - 技术博客、安全公告、CVE数据库 - 学术论文、国际安全会议论文(BlackHat、DEF CON、RSA) - 行业报告、威胁情报源 - 特定平台官方文档(AWS、Azure、GCP、macOS) - 监管框架与合规标准 ### 2. **多阶段处理流水线** 原始内容 → 语言检测 → 主题分类 → 指令合成 → 质量验证 → 专家评审 → 伦理过滤 → 最终数据集 ### 3. **质量保障框架** - **自动化检查**:语法校验、技术准确性核查、响应完整性验证 - **去重处理**:基于语义相似度的高级MinHash LSH算法 - **幻觉检测**:针对权威来源的事实核查 - **伦理合规**:过滤攻击性内容、防范两用技术滥用 - **专家验证**:10%样本由安全专业人员手动审核 ### 4. **主题覆盖验证** - 全面映射至行业标准框架(MITRE ATT&CK、NIST、ISO 27001) - 与当前威胁态势报告交叉验证 - 基于真实世界事件模式进行校验 --- ## 💻 使用示例 ### 基础加载方式 python from datasets import load_dataset # 加载完整训练集 dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="train") # 加载指定划分数据集 val_dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="validation") # 查看第一条样本 print(f"System: {dataset[0]['system']}") print(f"User: {dataset[0]['user']}") print(f"Assistant: {dataset[0]['assistant']}") ### 微调配置示例 python from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from trl import SFTTrainer from peft import LoraConfig, get_peft_model # 模型配置 model = AutoModelForCausalLM.from_pretrained("base-model-name", load_in_8bit=True) tokenizer = AutoTokenizer.from_pretrained("base-model-name") # 用于高效微调的LoRA配置 peft_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.1, bias="none", task_type="CAUSAL_LM", target_modules=["q_proj", "v_proj", "k_proj", "o_proj"] ) # 训练参数配置 training_args = TrainingArguments( output_dir="./cybersec-finetuned", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, warmup_steps=100, logging_steps=25, save_strategy="epoch", evaluation_strategy="epoch", learning_rate=2e-4, bf16=True, gradient_checkpointing=True, ) # 初始化训练器 trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset, tokenizer=tokenizer, peft_config=peft_config, max_seq_length=4096, dataset_text_field="text", # 拼接后的字段 ) --- ## 🎯 专业覆盖领域 ### 🔐 包含的高级主题 1. **云原生安全** - 多云取证与事件响应 - 容器与Kubernetes安全 - 无服务器与FaaS安全模式 - 云原生应用保护(CNAPP) 2. **AI/ML安全** - 对抗机器学习防御 - 模型投毒检测 - AI驱动威胁情报 - 用于威胁共享的联邦学习 3. **新兴威胁** - 5G网络安全与边缘计算 - 量子计算威胁态势 - 后量子密码学实现 - 供应链安全自动化 4. **特定平台安全** - macOS安全内核与取证 - 云厂商专属安全控制 - OT/ICS与关键基础设施防护 - 移动与IoT安全框架 --- ## ⚖️ 伦理考量 ### 负责任AI指南 - **防御导向**:所有内容均强调防护与防御,绝不涉及攻击技术 - **拒绝机制**:内置响应逻辑以拒绝恶意请求 - **两用技术防范**:精心遴选内容,避免助力有害活动 - **隐私保护**:不包含个人可识别信息(PII)或敏感组织数据 - **偏差缓解**:在厂商、平台与方法论间保持均衡覆盖 ### 使用限制 - 不得用于开发攻击性安全工具 - 不得用于绕过安全控制 - 不得用于未授权访问或利用行为 - 必须遵守当地法律法规 --- ## 🚧 已知局限性 1. **语言限制**:仅支持英语(计划推出多语言版本) 2. **时效性**:知识截止日期因数据源而异(大部分内容截至2024-2025年) 3. **地域偏差**:以西方式框架与监管标准为主 4. **领域快速演变**:安全领域变化迅速,需定期更新数据集 5. **复杂度平衡**:部分主题对普通安全从业者可能过于专业 --- ## 📚 引用格式 bibtex @dataset{trendyol_2025_cybersec_v2, author = {{Trendyol安全团队}}, title = {Trendyol网络安全防御指令微调数据集v2.0}, year = {2025}, month = {7}, publisher = {Hugging Face}, version = {2.0.0}, } --- ## 🤝 贡献指南 我们欢迎安全社区的贡献!请确保: - ✅ 内容聚焦防御安全 - ✅ 技术准确且附带引用 - ✅ 遵循数据集字段规范 - ✅ 通过质量检查 - ✅ 包含适当文档说明 --- ## 🙏 致谢 特别感谢全球网络安全社区、安全研究者与开源贡献者,本数据集基于数十年防御安全实践的集体知识构建。 --- ## 📜 变更日志 - **v2.0.0**(2025-07-30):大幅扩展至53,000+样本、200+主题,新增特定平台内容 --- <div align="center"> <i>通过负责任的AI与协作安全情报,构建更安全的数字未来。</i> </div>
提供机构:
invinciblejha01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作