Trendyol-Cybersecurity-Instruction-Tuning-Dataset
收藏魔搭社区2026-01-06 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Trendyol Cybersecurity Defense Instruction-Tuning Dataset (v2.0)
<div align="center">
<img src="https://img.shields.io/badge/rows-53,202-brightgreen" alt="Dataset Size">
<img src="https://img.shields.io/badge/license-Apache%202.0-blue" alt="License">
<img src="https://img.shields.io/badge/language-English-red" alt="Language">
<img src="https://img.shields.io/badge/version-2.0.0-orange" alt="Version">
</div>
## 🚀 TL;DR
**53,202** meticulously curated *system/user/assistant* instruction-tuning examples covering **200+ specialized cybersecurity domains**. Built by the Trendyol Security Team for training state-of-the-art defensive security AI assistants. Expanded from 21K to 53K rows with comprehensive coverage of modern security challenges including cloud-native threats, AI/ML security, quantum computing risks, and advanced incident response techniques.
---
## 📊 What's New in v2.0 (2025-07-30)
| Metric | v1.1 | v2.0 | Change |
|--------|------|------|--------|
| **Total Rows** | 21,258 | **53,202** | +150.3% |
| **Unique Topics** | 50+ | **200+** | +300% |
| **Coverage Depth** | Basic-Intermediate | **Basic-Expert** | Enhanced |
| **Specialized Domains** | Traditional Security | **+ AI/ML, Quantum, Cloud-Native, OT/ICS** | Expanded |
| **Framework Integration** | MITRE ATT&CK, NIST | **+ STIX/TAXII, Diamond Model, Zero Trust** | Comprehensive |
| **Platform Specific** | Generic | **+ macOS, Cloud Providers, Container Orchestration** | Targeted |
### 🎯 Major Additions in v2.0
- **Advanced Threat Intelligence**: 5G networks, AI-powered analysis, quantum computing threats
- **Cloud-Native Security**: Kubernetes forensics, serverless security, multi-cloud environments
- **Emerging Technologies**: Post-quantum cryptography, DNA computing security, metamaterial computing
- **Platform-Specific**: Deep macOS security analysis, cloud provider-specific forensics
- **Operational Excellence**: SOAR automation, threat hunting metrics, incident response orchestration
---
## 📋 Dataset Summary
| Property | Value |
|----------|-------|
| **Language** | English |
| **License** | Apache 2.0 |
| **Format** | Parquet (optimized columnar storage) |
| **Total Rows** | 53,202 |
| **Columns** | `system`, `user`, `assistant` |
| **Splits** | `train` (90%), `validation` (5%), `test` (5%) |
| **Average Response Length** | ~700 tokens |
| **Compression Ratio** | 0.72 |
### 📊 Topic Distribution
```
Cloud Security & DevSecOps : 18.5%
Threat Intelligence & Hunting : 16.2%
Incident Response & Forensics : 14.8%
AI/ML Security : 12.3%
Network & Protocol Security : 11.7%
Identity & Access Management : 9.4%
Emerging Technologies : 8.6%
Platform-Specific Security : 5.3%
Compliance & Governance : 3.2%
```
---
## 🏗️ Dataset Structure
### Fields Description
| Field | Type | Description | Example |
|-------|------|-------------|---------|
| `system` | *string* | Role definition with ethical guidelines | "You are an expert cybersecurity professional..." |
| `user` | *string* | Realistic security question/scenario | "How can I detect API gateway abuse in microservices?" |
| `assistant` | *string* | Comprehensive technical response | "API gateway abuse detection requires multi-layered..." |
### Data Splits
```python
{
"train": 47,882, # 90%
"validation": 2,660, # 5%
"test": 2,660 # 5%
}
```
---
## 🔬 Dataset Creation Process
### 1. **Advanced Content Curation** (500K+ sources)
- Technical blogs, security advisories, CVE databases
- Academic papers, conference proceedings (BlackHat, DEF CON, RSA)
- Industry reports, threat intelligence feeds
- Platform-specific documentation (AWS, Azure, GCP, macOS)
- Regulatory frameworks and compliance standards
### 2. **Multi-Stage Processing Pipeline**
```
Raw Content → Language Detection → Topic Classification →
Instruction Synthesis → Quality Validation → Expert Review →
Ethical Filtering → Final Dataset
```
### 3. **Quality Assurance Framework**
- **Automated Checks**: Grammar, technical accuracy, response completeness
- **Deduplication**: Advanced MinHash LSH with semantic similarity
- **Hallucination Detection**: Fact-checking against authoritative sources
- **Ethical Compliance**: Offensive content filtering, dual-use prevention
- **Expert Validation**: 10% manual review by security professionals
### 4. **Topic Coverage Validation**
- Comprehensive mapping to industry frameworks (MITRE ATT&CK, NIST, ISO 27001)
- Cross-reference with current threat landscape report1
- Validation against real-world incident patterns
---
## 💻 Usage Examples
### Basic Loading
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="train")
# Load specific split
val_dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="validation")
# First example
print(f"System: {dataset[0]['system']}")
print(f"User: {dataset[0]['user']}")
print(f"Assistant: {dataset[0]['assistant']}")
```
### Fine-Tuning Configuration
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model
# Model configuration
model = AutoModelForCausalLM.from_pretrained("base-model-name", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("base-model-name")
# LoRA configuration for efficient fine-tuning
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]
)
# Training configuration
training_args = TrainingArguments(
output_dir="./cybersec-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
logging_steps=25,
save_strategy="epoch",
evaluation_strategy="epoch",
learning_rate=2e-4,
bf16=True,
gradient_checkpointing=True,
)
# Initialize trainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
max_seq_length=4096,
dataset_text_field="text", # Concatenated field
)
```
---
## 🎯 Specialized Coverage Areas
### 🔐 Advanced Topics Included
1. **Cloud-Native Security**
- Multi-cloud forensics and incident response
- Container and Kubernetes security
- Serverless and FaaS security patterns
- Cloud-native application protection (CNAPP)
2. **AI/ML Security**
- Adversarial machine learning defense
- Model poisoning detection
- AI-powered threat intelligence
- Federated learning for threat sharing
3. **Emerging Threats**
- 5G network security and edge computing
- Quantum computing threat landscape
- Post-quantum cryptography implementation
- Supply chain security automation
4. **Platform-Specific Security**
- macOS security internals and forensics
- Cloud provider-specific security controls
- OT/ICS and critical infrastructure protection
- Mobile and IoT security frameworks
---
## ⚖️ Ethical Considerations
### Responsible AI Guidelines
- **Defensive Focus**: All content emphasizes protection and defense, never attack techniques
- **Refusal Patterns**: Built-in responses for rejecting malicious requests
- **Dual-Use Prevention**: Careful curation to avoid enabling harmful activities
- **Privacy Protection**: No PII or sensitive organizational data included
- **Bias Mitigation**: Balanced representation across vendors, platforms, and methodologies
### Usage Restrictions
- Not for developing offensive security tools
- Not for bypassing security controls
- Not for unauthorized access or exploitation
- Must comply with local laws and regulations
---
## 🚧 Known Limitations
1. **Language**: English-only (multilingual expansion planned)
2. **Temporal**: Knowledge cutoff varies by source (majority 2024-2025)
3. **Geographic Bias**: Western-centric frameworks and regulations
4. **Rapid Evolution**: Security landscape changes require regular updates
5. **Complexity Balance**: Some topics may be too advanced for general practitioners
---
## 📚 Citation
```bibtex
@dataset{trendyol_2025_cybersec_v2,
author = {{Trendyol Security Team}},
title = {Trendyol Cybersecurity Defense Instruction-Tuning Dataset v2.0},
year = {2025},
month = {7},
publisher = {Hugging Face},
version = {2.0.0},
}
```
---
## 🤝 Contributing
We welcome contributions from the security community! Please ensure:
- ✅ Defensive security focus
- ✅ Technical accuracy with references
- ✅ Follows the dataset schema
- ✅ Passes quality checks
- ✅ Includes appropriate documentation
---
## 🙏 Acknowledgments
Special thanks to the global cybersecurity community, security researchers, and open-source contributors who made this dataset possible. This work builds upon decades of collective knowledge in defensive security practices.
---
## 📜 Changelog
- **v2.0.0** (2025-07-30): Major expansion to 53K+ examples, 200+ topics, platform-specific content
---
<div align="center">
<i>Building a safer digital future through responsible AI and collaborative security intelligence.</i>
</div>
# Trendyol 网络安全防御指令微调数据集(v2.0)
<div align="center">
<img src="https://img.shields.io/badge/rows-53,202-brightgreen" alt="数据集规模">
<img src="https://img.shields.io/badge/license-Apache%202.0-blue" alt="许可证">
<img src="https://img.shields.io/badge/language-English-red" alt="语言">
<img src="https://img.shields.io/badge/version-2.0.0-orange" alt="版本">
</div>
## 🚀 速览(Too Long Didn't Read,TL;DR)
**53,202** 条经过严谨筛选与整理的*系统(system)/用户(user)/助手(assistant)*格式的指令微调样本,覆盖**200余个细分网络安全领域**。本数据集由Trendyol安全团队构建,用于训练最先进的防御型网络安全AI智能体(AI Agent)。该数据集从2.1万条扩展至5.3万条,全面覆盖现代安全挑战,包括云原生威胁、AI/ML安全、量子计算风险以及高级事件响应技术。
---
## 📊 v2.0版本更新内容(2025年7月30日)
| 指标 | v1.1 | v2.0 | 变化幅度 |
|--------|------|------|--------|
| **总样本数** | 21,258 | **53,202** | +150.3% |
| **唯一主题数** | 50+ | **200+** | +300% |
| **覆盖深度** | 基础-中级 | **基础-专家级** | 覆盖范围升级 |
| **细分领域** | 传统安全领域 | **新增AI/ML、量子计算、云原生、OT/ICS领域** | 领域边界大幅扩展 |
| **框架集成** | MITRE ATT&CK、NIST | **新增STIX/TAXII、钻石模型(Diamond Model)、零信任(Zero Trust)** | 集成覆盖更全面 |
| **特定平台覆盖** | 通用场景 | **新增macOS、云服务商、容器编排平台** | 覆盖更具针对性 |
### 🎯 v2.0版本新增核心内容
- **高级威胁情报**:5G网络、AI驱动分析、量子计算威胁
- **云原生安全**:Kubernetes取证、无服务器安全、多云环境
- **新兴技术安全**:后量子密码学、DNA计算安全、超材料计算安全
- **特定平台安全**:深度macOS安全分析、云服务商专属取证
- **运营优化**:安全编排自动化与响应(SOAR)自动化、威胁狩猎指标、事件响应编排
---
## 📋 数据集概览
| 属性 | 取值 |
|----------|-------|
| **语言** | 英语 |
| **许可证** | Apache 2.0 |
| **存储格式** | Parquet(列式优化存储) |
| **总样本数** | 53,202 |
| **字段** | `system`、`user`、`assistant` |
| **数据集划分** | 训练集(train,90%)、验证集(validation,5%)、测试集(test,5%) |
| **平均响应长度** | 约700个Token(Token) |
| **压缩比** | 0.72 |
### 📊 主题分布
云安全与DevSecOps : 18.5%
威胁情报与狩猎 : 16.2%
事件响应与取证 : 14.8%
AI/ML安全 : 12.3%
网络与协议安全 : 11.7%
身份与访问管理 : 9.4%
新兴技术安全 : 8.6%
特定平台安全 : 5.3%
合规与治理 : 3.2%
---
## 🏗️ 数据集结构
### 字段说明
| 字段 | 数据类型 | 说明 | 示例 |
|-------|------|-------------|---------|
| `system` | 字符串(string) | 带有伦理准则的角色定义 | "你是一名资深网络安全专家……" |
| `user` | 字符串(string) | 贴合实际的安全问题或场景 | "如何检测微服务中的API网关滥用行为?" |
| `assistant` | 字符串(string) | 完整的技术响应 | "API网关滥用检测需要多层级的……" |
### 数据集划分
python
{
"train": 47882, # 训练集:47882条,占比90%
"validation": 2660, # 验证集:2660条,占比5%
"test": 2660 # 测试集:2660条,占比5%
}
---
## 🔬 数据集构建流程
### 1. **高级内容筛选**(超50万个数据源)
- 技术博客、安全公告、通用漏洞披露(CVE)数据库
- 学术论文、会议论文集(BlackHat、DEF CON、RSA大会)
- 行业报告、威胁情报源
- 特定平台文档(AWS、Azure、GCP、macOS)
- 监管框架与合规标准
### 2. **多阶段处理流水线**
原始内容 → 语言检测 → 主题分类 →
指令合成 → 质量验证 → 专家审核 →
伦理过滤 → 最终数据集
### 3. **质量保障框架**
- **自动化检查**:语法、技术准确性、响应完整性
- **去重处理**:基于MinHash LSH的高级语义相似度去重
- **幻觉检测**:基于权威来源的事实校验
- **伦理合规**:攻击性内容过滤、两用技术防范
- **专家审核**:由安全专业人员完成10%的人工审核
### 4. **主题覆盖验证**
- 与行业框架(MITRE ATT&CK、NIST、ISO 27001)的全面映射
- 与当前威胁态势报告的交叉验证
- 与真实世界事件模式的比对验证
---
## 💻 使用示例
### 基础加载
python
from datasets import load_dataset
# 加载完整训练集
dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="train")
# 加载指定划分数据集
val_dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="validation")
# 查看第一条样本
print(f"系统提示:{dataset[0]['system']}")
print(f"用户提问:{dataset[0]['user']}")
print(f"助手回复:{dataset[0]['assistant']}")
### 微调配置
python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer # 监督微调训练器(Supervised Fine-Tuning Trainer)
from peft import LoraConfig, get_peft_model
# 模型配置
model = AutoModelForCausalLM.from_pretrained("基础模型名称", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("基础模型名称")
# 用于高效微调的LoRA(Low-Rank Adaptation,低秩适配)配置
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]
)
# 训练配置
training_args = TrainingArguments(
output_dir="./cybersec-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
logging_steps=25,
save_strategy="epoch",
evaluation_strategy="epoch",
learning_rate=2e-4,
bf16=True,
gradient_checkpointing=True,
)
# 初始化训练器
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
max_seq_length=4096,
dataset_text_field="text", # 拼接后的文本字段
)
---
## 🎯 专属覆盖领域
### 🔐 包含的高级主题
1. **云原生安全**
- 多云取证与事件响应
- 容器与Kubernetes安全
- 无服务器与函数即服务(FaaS)安全模式
- 云原生应用保护平台(CNAPP)
2. **AI/ML安全**
- 对抗性机器学习防御
- 模型投毒检测
- AI驱动的威胁情报
- 用于威胁共享的联邦学习
3. **新兴威胁**
- 5G网络安全与边缘计算
- 量子计算威胁态势
- 后量子密码学实施
- 供应链安全自动化
4. **特定平台安全**
- macOS安全内核与取证
- 云服务商专属安全控制
- OT/ICS与关键基础设施防护
- 移动与IoT安全框架
---
## ⚖️ 伦理考量
### 负责任AI准则
- **防御导向**:所有内容均强调防护与防御,绝不涉及攻击技术
- **拒绝机制**:内置针对恶意请求的拒绝响应模板
- **两用技术防范**:经过严谨筛选,避免内容被用于有害活动
- **隐私保护**:未包含任何个人可识别信息(PII)或敏感组织数据
- **偏差缓解**:在厂商、平台与方法论间保持平衡的样本分布
### 使用限制
- 不得用于开发攻击性安全工具
- 不得用于绕过安全控制
- 不得用于未经授权的访问或漏洞利用
- 必须遵守当地法律法规
---
## 🚧 已知局限性
1. **语言限制**:仅支持英语(计划推出多语言版本)
2. **时效性**:知识截止日期因数据源而异(大部分数据源为2024-2025年)
3. **地域偏差**:框架与法规以西方体系为主
4. **快速迭代**:网络安全领域变化迅速,需定期更新数据集
5. **复杂度平衡**:部分主题对普通从业者而言可能过于专业
---
## 📚 引用
bibtex
@dataset{trendyol_2025_cybersec_v2,
author = {{Trendyol Security Team}},
title = {Trendyol Cybersecurity Defense Instruction-Tuning Dataset v2.0},
year = {2025},
month = {7},
publisher = {Hugging Face},
version = {2.0.0},
}
---
## 🤝 贡献指南
我们欢迎网络安全社区的贡献!请确保:
- ✅ 以防御型安全为主题
- ✅ 内容技术准确并附有参考来源
- ✅ 符合数据集的字段规范
- ✅ 通过质量检查
- ✅ 包含适当的文档说明
---
## 🙏 致谢
特别感谢全球网络安全社区、安全研究人员与开源贡献者,正是他们的集体努力促成了本数据集的完成。本数据集基于防御性安全实践数十年的积累知识构建而成。
---
## 📜 更新日志
- **v2.0.0**(2025年7月30日):大幅扩展至5.3万余条样本,覆盖200余个主题,新增特定平台相关内容
---
<div align="center">
<i>通过负责任的AI与协作式安全情报,共建更安全的数字未来。</i>
</div>
提供机构:
maas
创建时间:
2025-08-01



