five

Trendyol-Cybersecurity-Instruction-Tuning-Dataset

收藏
魔搭社区2026-01-06 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
# Trendyol Cybersecurity Defense Instruction-Tuning Dataset (v2.0) <div align="center"> <img src="https://img.shields.io/badge/rows-53,202-brightgreen" alt="Dataset Size"> <img src="https://img.shields.io/badge/license-Apache%202.0-blue" alt="License"> <img src="https://img.shields.io/badge/language-English-red" alt="Language"> <img src="https://img.shields.io/badge/version-2.0.0-orange" alt="Version"> </div> ## 🚀 TL;DR **53,202** meticulously curated *system/user/assistant* instruction-tuning examples covering **200+ specialized cybersecurity domains**. Built by the Trendyol Security Team for training state-of-the-art defensive security AI assistants. Expanded from 21K to 53K rows with comprehensive coverage of modern security challenges including cloud-native threats, AI/ML security, quantum computing risks, and advanced incident response techniques. --- ## 📊 What's New in v2.0 (2025-07-30) | Metric | v1.1 | v2.0 | Change | |--------|------|------|--------| | **Total Rows** | 21,258 | **53,202** | +150.3% | | **Unique Topics** | 50+ | **200+** | +300% | | **Coverage Depth** | Basic-Intermediate | **Basic-Expert** | Enhanced | | **Specialized Domains** | Traditional Security | **+ AI/ML, Quantum, Cloud-Native, OT/ICS** | Expanded | | **Framework Integration** | MITRE ATT&CK, NIST | **+ STIX/TAXII, Diamond Model, Zero Trust** | Comprehensive | | **Platform Specific** | Generic | **+ macOS, Cloud Providers, Container Orchestration** | Targeted | ### 🎯 Major Additions in v2.0 - **Advanced Threat Intelligence**: 5G networks, AI-powered analysis, quantum computing threats - **Cloud-Native Security**: Kubernetes forensics, serverless security, multi-cloud environments - **Emerging Technologies**: Post-quantum cryptography, DNA computing security, metamaterial computing - **Platform-Specific**: Deep macOS security analysis, cloud provider-specific forensics - **Operational Excellence**: SOAR automation, threat hunting metrics, incident response orchestration --- ## 📋 Dataset Summary | Property | Value | |----------|-------| | **Language** | English | | **License** | Apache 2.0 | | **Format** | Parquet (optimized columnar storage) | | **Total Rows** | 53,202 | | **Columns** | `system`, `user`, `assistant` | | **Splits** | `train` (90%), `validation` (5%), `test` (5%) | | **Average Response Length** | ~700 tokens | | **Compression Ratio** | 0.72 | ### 📊 Topic Distribution ``` Cloud Security & DevSecOps : 18.5% Threat Intelligence & Hunting : 16.2% Incident Response & Forensics : 14.8% AI/ML Security : 12.3% Network & Protocol Security : 11.7% Identity & Access Management : 9.4% Emerging Technologies : 8.6% Platform-Specific Security : 5.3% Compliance & Governance : 3.2% ``` --- ## 🏗️ Dataset Structure ### Fields Description | Field | Type | Description | Example | |-------|------|-------------|---------| | `system` | *string* | Role definition with ethical guidelines | "You are an expert cybersecurity professional..." | | `user` | *string* | Realistic security question/scenario | "How can I detect API gateway abuse in microservices?" | | `assistant` | *string* | Comprehensive technical response | "API gateway abuse detection requires multi-layered..." | ### Data Splits ```python { "train": 47,882, # 90% "validation": 2,660, # 5% "test": 2,660 # 5% } ``` --- ## 🔬 Dataset Creation Process ### 1. **Advanced Content Curation** (500K+ sources) - Technical blogs, security advisories, CVE databases - Academic papers, conference proceedings (BlackHat, DEF CON, RSA) - Industry reports, threat intelligence feeds - Platform-specific documentation (AWS, Azure, GCP, macOS) - Regulatory frameworks and compliance standards ### 2. **Multi-Stage Processing Pipeline** ``` Raw Content → Language Detection → Topic Classification → Instruction Synthesis → Quality Validation → Expert Review → Ethical Filtering → Final Dataset ``` ### 3. **Quality Assurance Framework** - **Automated Checks**: Grammar, technical accuracy, response completeness - **Deduplication**: Advanced MinHash LSH with semantic similarity - **Hallucination Detection**: Fact-checking against authoritative sources - **Ethical Compliance**: Offensive content filtering, dual-use prevention - **Expert Validation**: 10% manual review by security professionals ### 4. **Topic Coverage Validation** - Comprehensive mapping to industry frameworks (MITRE ATT&CK, NIST, ISO 27001) - Cross-reference with current threat landscape report1 - Validation against real-world incident patterns --- ## 💻 Usage Examples ### Basic Loading ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="train") # Load specific split val_dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="validation") # First example print(f"System: {dataset[0]['system']}") print(f"User: {dataset[0]['user']}") print(f"Assistant: {dataset[0]['assistant']}") ``` ### Fine-Tuning Configuration ```python from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from trl import SFTTrainer from peft import LoraConfig, get_peft_model # Model configuration model = AutoModelForCausalLM.from_pretrained("base-model-name", load_in_8bit=True) tokenizer = AutoTokenizer.from_pretrained("base-model-name") # LoRA configuration for efficient fine-tuning peft_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.1, bias="none", task_type="CAUSAL_LM", target_modules=["q_proj", "v_proj", "k_proj", "o_proj"] ) # Training configuration training_args = TrainingArguments( output_dir="./cybersec-finetuned", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, warmup_steps=100, logging_steps=25, save_strategy="epoch", evaluation_strategy="epoch", learning_rate=2e-4, bf16=True, gradient_checkpointing=True, ) # Initialize trainer trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset, tokenizer=tokenizer, peft_config=peft_config, max_seq_length=4096, dataset_text_field="text", # Concatenated field ) ``` --- ## 🎯 Specialized Coverage Areas ### 🔐 Advanced Topics Included 1. **Cloud-Native Security** - Multi-cloud forensics and incident response - Container and Kubernetes security - Serverless and FaaS security patterns - Cloud-native application protection (CNAPP) 2. **AI/ML Security** - Adversarial machine learning defense - Model poisoning detection - AI-powered threat intelligence - Federated learning for threat sharing 3. **Emerging Threats** - 5G network security and edge computing - Quantum computing threat landscape - Post-quantum cryptography implementation - Supply chain security automation 4. **Platform-Specific Security** - macOS security internals and forensics - Cloud provider-specific security controls - OT/ICS and critical infrastructure protection - Mobile and IoT security frameworks --- ## ⚖️ Ethical Considerations ### Responsible AI Guidelines - **Defensive Focus**: All content emphasizes protection and defense, never attack techniques - **Refusal Patterns**: Built-in responses for rejecting malicious requests - **Dual-Use Prevention**: Careful curation to avoid enabling harmful activities - **Privacy Protection**: No PII or sensitive organizational data included - **Bias Mitigation**: Balanced representation across vendors, platforms, and methodologies ### Usage Restrictions - Not for developing offensive security tools - Not for bypassing security controls - Not for unauthorized access or exploitation - Must comply with local laws and regulations --- ## 🚧 Known Limitations 1. **Language**: English-only (multilingual expansion planned) 2. **Temporal**: Knowledge cutoff varies by source (majority 2024-2025) 3. **Geographic Bias**: Western-centric frameworks and regulations 4. **Rapid Evolution**: Security landscape changes require regular updates 5. **Complexity Balance**: Some topics may be too advanced for general practitioners --- ## 📚 Citation ```bibtex @dataset{trendyol_2025_cybersec_v2, author = {{Trendyol Security Team}}, title = {Trendyol Cybersecurity Defense Instruction-Tuning Dataset v2.0}, year = {2025}, month = {7}, publisher = {Hugging Face}, version = {2.0.0}, } ``` --- ## 🤝 Contributing We welcome contributions from the security community! Please ensure: - ✅ Defensive security focus - ✅ Technical accuracy with references - ✅ Follows the dataset schema - ✅ Passes quality checks - ✅ Includes appropriate documentation --- ## 🙏 Acknowledgments Special thanks to the global cybersecurity community, security researchers, and open-source contributors who made this dataset possible. This work builds upon decades of collective knowledge in defensive security practices. --- ## 📜 Changelog - **v2.0.0** (2025-07-30): Major expansion to 53K+ examples, 200+ topics, platform-specific content --- <div align="center"> <i>Building a safer digital future through responsible AI and collaborative security intelligence.</i> </div>

# Trendyol 网络安全防御指令微调数据集(v2.0) <div align="center"> <img src="https://img.shields.io/badge/rows-53,202-brightgreen" alt="数据集规模"> <img src="https://img.shields.io/badge/license-Apache%202.0-blue" alt="许可证"> <img src="https://img.shields.io/badge/language-English-red" alt="语言"> <img src="https://img.shields.io/badge/version-2.0.0-orange" alt="版本"> </div> ## 🚀 速览(Too Long Didn't Read,TL;DR) **53,202** 条经过严谨筛选与整理的*系统(system)/用户(user)/助手(assistant)*格式的指令微调样本,覆盖**200余个细分网络安全领域**。本数据集由Trendyol安全团队构建,用于训练最先进的防御型网络安全AI智能体(AI Agent)。该数据集从2.1万条扩展至5.3万条,全面覆盖现代安全挑战,包括云原生威胁、AI/ML安全、量子计算风险以及高级事件响应技术。 --- ## 📊 v2.0版本更新内容(2025年7月30日) | 指标 | v1.1 | v2.0 | 变化幅度 | |--------|------|------|--------| | **总样本数** | 21,258 | **53,202** | +150.3% | | **唯一主题数** | 50+ | **200+** | +300% | | **覆盖深度** | 基础-中级 | **基础-专家级** | 覆盖范围升级 | | **细分领域** | 传统安全领域 | **新增AI/ML、量子计算、云原生、OT/ICS领域** | 领域边界大幅扩展 | | **框架集成** | MITRE ATT&CK、NIST | **新增STIX/TAXII、钻石模型(Diamond Model)、零信任(Zero Trust)** | 集成覆盖更全面 | | **特定平台覆盖** | 通用场景 | **新增macOS、云服务商、容器编排平台** | 覆盖更具针对性 | ### 🎯 v2.0版本新增核心内容 - **高级威胁情报**:5G网络、AI驱动分析、量子计算威胁 - **云原生安全**:Kubernetes取证、无服务器安全、多云环境 - **新兴技术安全**:后量子密码学、DNA计算安全、超材料计算安全 - **特定平台安全**:深度macOS安全分析、云服务商专属取证 - **运营优化**:安全编排自动化与响应(SOAR)自动化、威胁狩猎指标、事件响应编排 --- ## 📋 数据集概览 | 属性 | 取值 | |----------|-------| | **语言** | 英语 | | **许可证** | Apache 2.0 | | **存储格式** | Parquet(列式优化存储) | | **总样本数** | 53,202 | | **字段** | `system`、`user`、`assistant` | | **数据集划分** | 训练集(train,90%)、验证集(validation,5%)、测试集(test,5%) | | **平均响应长度** | 约700个Token(Token) | | **压缩比** | 0.72 | ### 📊 主题分布 云安全与DevSecOps : 18.5% 威胁情报与狩猎 : 16.2% 事件响应与取证 : 14.8% AI/ML安全 : 12.3% 网络与协议安全 : 11.7% 身份与访问管理 : 9.4% 新兴技术安全 : 8.6% 特定平台安全 : 5.3% 合规与治理 : 3.2% --- ## 🏗️ 数据集结构 ### 字段说明 | 字段 | 数据类型 | 说明 | 示例 | |-------|------|-------------|---------| | `system` | 字符串(string) | 带有伦理准则的角色定义 | "你是一名资深网络安全专家……" | | `user` | 字符串(string) | 贴合实际的安全问题或场景 | "如何检测微服务中的API网关滥用行为?" | | `assistant` | 字符串(string) | 完整的技术响应 | "API网关滥用检测需要多层级的……" | ### 数据集划分 python { "train": 47882, # 训练集:47882条,占比90% "validation": 2660, # 验证集:2660条,占比5% "test": 2660 # 测试集:2660条,占比5% } --- ## 🔬 数据集构建流程 ### 1. **高级内容筛选**(超50万个数据源) - 技术博客、安全公告、通用漏洞披露(CVE)数据库 - 学术论文、会议论文集(BlackHat、DEF CON、RSA大会) - 行业报告、威胁情报源 - 特定平台文档(AWS、Azure、GCP、macOS) - 监管框架与合规标准 ### 2. **多阶段处理流水线** 原始内容 → 语言检测 → 主题分类 → 指令合成 → 质量验证 → 专家审核 → 伦理过滤 → 最终数据集 ### 3. **质量保障框架** - **自动化检查**:语法、技术准确性、响应完整性 - **去重处理**:基于MinHash LSH的高级语义相似度去重 - **幻觉检测**:基于权威来源的事实校验 - **伦理合规**:攻击性内容过滤、两用技术防范 - **专家审核**:由安全专业人员完成10%的人工审核 ### 4. **主题覆盖验证** - 与行业框架(MITRE ATT&CK、NIST、ISO 27001)的全面映射 - 与当前威胁态势报告的交叉验证 - 与真实世界事件模式的比对验证 --- ## 💻 使用示例 ### 基础加载 python from datasets import load_dataset # 加载完整训练集 dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="train") # 加载指定划分数据集 val_dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="validation") # 查看第一条样本 print(f"系统提示:{dataset[0]['system']}") print(f"用户提问:{dataset[0]['user']}") print(f"助手回复:{dataset[0]['assistant']}") ### 微调配置 python from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from trl import SFTTrainer # 监督微调训练器(Supervised Fine-Tuning Trainer) from peft import LoraConfig, get_peft_model # 模型配置 model = AutoModelForCausalLM.from_pretrained("基础模型名称", load_in_8bit=True) tokenizer = AutoTokenizer.from_pretrained("基础模型名称") # 用于高效微调的LoRA(Low-Rank Adaptation,低秩适配)配置 peft_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.1, bias="none", task_type="CAUSAL_LM", target_modules=["q_proj", "v_proj", "k_proj", "o_proj"] ) # 训练配置 training_args = TrainingArguments( output_dir="./cybersec-finetuned", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, warmup_steps=100, logging_steps=25, save_strategy="epoch", evaluation_strategy="epoch", learning_rate=2e-4, bf16=True, gradient_checkpointing=True, ) # 初始化训练器 trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset, tokenizer=tokenizer, peft_config=peft_config, max_seq_length=4096, dataset_text_field="text", # 拼接后的文本字段 ) --- ## 🎯 专属覆盖领域 ### 🔐 包含的高级主题 1. **云原生安全** - 多云取证与事件响应 - 容器与Kubernetes安全 - 无服务器与函数即服务(FaaS)安全模式 - 云原生应用保护平台(CNAPP) 2. **AI/ML安全** - 对抗性机器学习防御 - 模型投毒检测 - AI驱动的威胁情报 - 用于威胁共享的联邦学习 3. **新兴威胁** - 5G网络安全与边缘计算 - 量子计算威胁态势 - 后量子密码学实施 - 供应链安全自动化 4. **特定平台安全** - macOS安全内核与取证 - 云服务商专属安全控制 - OT/ICS与关键基础设施防护 - 移动与IoT安全框架 --- ## ⚖️ 伦理考量 ### 负责任AI准则 - **防御导向**:所有内容均强调防护与防御,绝不涉及攻击技术 - **拒绝机制**:内置针对恶意请求的拒绝响应模板 - **两用技术防范**:经过严谨筛选,避免内容被用于有害活动 - **隐私保护**:未包含任何个人可识别信息(PII)或敏感组织数据 - **偏差缓解**:在厂商、平台与方法论间保持平衡的样本分布 ### 使用限制 - 不得用于开发攻击性安全工具 - 不得用于绕过安全控制 - 不得用于未经授权的访问或漏洞利用 - 必须遵守当地法律法规 --- ## 🚧 已知局限性 1. **语言限制**:仅支持英语(计划推出多语言版本) 2. **时效性**:知识截止日期因数据源而异(大部分数据源为2024-2025年) 3. **地域偏差**:框架与法规以西方体系为主 4. **快速迭代**:网络安全领域变化迅速,需定期更新数据集 5. **复杂度平衡**:部分主题对普通从业者而言可能过于专业 --- ## 📚 引用 bibtex @dataset{trendyol_2025_cybersec_v2, author = {{Trendyol Security Team}}, title = {Trendyol Cybersecurity Defense Instruction-Tuning Dataset v2.0}, year = {2025}, month = {7}, publisher = {Hugging Face}, version = {2.0.0}, } --- ## 🤝 贡献指南 我们欢迎网络安全社区的贡献!请确保: - ✅ 以防御型安全为主题 - ✅ 内容技术准确并附有参考来源 - ✅ 符合数据集的字段规范 - ✅ 通过质量检查 - ✅ 包含适当的文档说明 --- ## 🙏 致谢 特别感谢全球网络安全社区、安全研究人员与开源贡献者,正是他们的集体努力促成了本数据集的完成。本数据集基于防御性安全实践数十年的积累知识构建而成。 --- ## 📜 更新日志 - **v2.0.0**(2025年7月30日):大幅扩展至5.3万余条样本,覆盖200余个主题,新增特定平台相关内容 --- <div align="center"> <i>通过负责任的AI与协作式安全情报,共建更安全的数字未来。</i> </div>
提供机构:
maas
创建时间:
2025-08-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作