invinciblejha01/Trendyol-Cybersecurity-Instruction-Tuning-Dataset
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/invinciblejha01/Trendyol-Cybersecurity-Instruction-Tuning-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- cybersecurity
- defensive-security
- instruction-tuning
- threat-intelligence
- incident-response
- security-operations
pretty_name: Trendyol Cybersecurity Defense Dataset
size_categories:
- 10K<n<100K
dataset_info:
version: 1.0.0
---
# Trendyol Cybersecurity Defense Instruction-Tuning Dataset (v2.0)
<div align="center">
<img src="https://img.shields.io/badge/rows-53,202-brightgreen" alt="Dataset Size">
<img src="https://img.shields.io/badge/license-Apache%202.0-blue" alt="License">
<img src="https://img.shields.io/badge/language-English-red" alt="Language">
<img src="https://img.shields.io/badge/version-2.0.0-orange" alt="Version">
</div>
## 🚀 TL;DR
**53,202** meticulously curated *system/user/assistant* instruction-tuning examples covering **200+ specialized cybersecurity domains**. Built by the Trendyol Security Team for training state-of-the-art defensive security AI assistants. Expanded from 21K to 53K rows with comprehensive coverage of modern security challenges including cloud-native threats, AI/ML security, quantum computing risks, and advanced incident response techniques.
---
## 📊 What's New in v2.0 (2025-07-30)
| Metric | v1.1 | v2.0 | Change |
|--------|------|------|--------|
| **Total Rows** | 21,258 | **53,202** | +150.3% |
| **Unique Topics** | 50+ | **200+** | +300% |
| **Coverage Depth** | Basic-Intermediate | **Basic-Expert** | Enhanced |
| **Specialized Domains** | Traditional Security | **+ AI/ML, Quantum, Cloud-Native, OT/ICS** | Expanded |
| **Framework Integration** | MITRE ATT&CK, NIST | **+ STIX/TAXII, Diamond Model, Zero Trust** | Comprehensive |
| **Platform Specific** | Generic | **+ macOS, Cloud Providers, Container Orchestration** | Targeted |
### 🎯 Major Additions in v2.0
- **Advanced Threat Intelligence**: 5G networks, AI-powered analysis, quantum computing threats
- **Cloud-Native Security**: Kubernetes forensics, serverless security, multi-cloud environments
- **Emerging Technologies**: Post-quantum cryptography, DNA computing security, metamaterial computing
- **Platform-Specific**: Deep macOS security analysis, cloud provider-specific forensics
- **Operational Excellence**: SOAR automation, threat hunting metrics, incident response orchestration
---
## 📋 Dataset Summary
| Property | Value |
|----------|-------|
| **Language** | English |
| **License** | Apache 2.0 |
| **Format** | Parquet (optimized columnar storage) |
| **Total Rows** | 53,202 |
| **Columns** | `system`, `user`, `assistant` |
| **Splits** | `train` (90%), `validation` (5%), `test` (5%) |
| **Average Response Length** | ~700 tokens |
| **Compression Ratio** | 0.72 |
### 📊 Topic Distribution
```
Cloud Security & DevSecOps : 18.5%
Threat Intelligence & Hunting : 16.2%
Incident Response & Forensics : 14.8%
AI/ML Security : 12.3%
Network & Protocol Security : 11.7%
Identity & Access Management : 9.4%
Emerging Technologies : 8.6%
Platform-Specific Security : 5.3%
Compliance & Governance : 3.2%
```
---
## 🏗️ Dataset Structure
### Fields Description
| Field | Type | Description | Example |
|-------|------|-------------|---------|
| `system` | *string* | Role definition with ethical guidelines | "You are an expert cybersecurity professional..." |
| `user` | *string* | Realistic security question/scenario | "How can I detect API gateway abuse in microservices?" |
| `assistant` | *string* | Comprehensive technical response | "API gateway abuse detection requires multi-layered..." |
### Data Splits
```python
{
"train": 47,882, # 90%
"validation": 2,660, # 5%
"test": 2,660 # 5%
}
```
---
## 🔬 Dataset Creation Process
### 1. **Advanced Content Curation** (500K+ sources)
- Technical blogs, security advisories, CVE databases
- Academic papers, conference proceedings (BlackHat, DEF CON, RSA)
- Industry reports, threat intelligence feeds
- Platform-specific documentation (AWS, Azure, GCP, macOS)
- Regulatory frameworks and compliance standards
### 2. **Multi-Stage Processing Pipeline**
```
Raw Content → Language Detection → Topic Classification →
Instruction Synthesis → Quality Validation → Expert Review →
Ethical Filtering → Final Dataset
```
### 3. **Quality Assurance Framework**
- **Automated Checks**: Grammar, technical accuracy, response completeness
- **Deduplication**: Advanced MinHash LSH with semantic similarity
- **Hallucination Detection**: Fact-checking against authoritative sources
- **Ethical Compliance**: Offensive content filtering, dual-use prevention
- **Expert Validation**: 10% manual review by security professionals
### 4. **Topic Coverage Validation**
- Comprehensive mapping to industry frameworks (MITRE ATT&CK, NIST, ISO 27001)
- Cross-reference with current threat landscape report1
- Validation against real-world incident patterns
---
## 💻 Usage Examples
### Basic Loading
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="train")
# Load specific split
val_dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="validation")
# First example
print(f"System: {dataset[0]['system']}")
print(f"User: {dataset[0]['user']}")
print(f"Assistant: {dataset[0]['assistant']}")
```
### Fine-Tuning Configuration
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model
# Model configuration
model = AutoModelForCausalLM.from_pretrained("base-model-name", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("base-model-name")
# LoRA configuration for efficient fine-tuning
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]
)
# Training configuration
training_args = TrainingArguments(
output_dir="./cybersec-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
logging_steps=25,
save_strategy="epoch",
evaluation_strategy="epoch",
learning_rate=2e-4,
bf16=True,
gradient_checkpointing=True,
)
# Initialize trainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
max_seq_length=4096,
dataset_text_field="text", # Concatenated field
)
```
---
## 🎯 Specialized Coverage Areas
### 🔐 Advanced Topics Included
1. **Cloud-Native Security**
- Multi-cloud forensics and incident response
- Container and Kubernetes security
- Serverless and FaaS security patterns
- Cloud-native application protection (CNAPP)
2. **AI/ML Security**
- Adversarial machine learning defense
- Model poisoning detection
- AI-powered threat intelligence
- Federated learning for threat sharing
3. **Emerging Threats**
- 5G network security and edge computing
- Quantum computing threat landscape
- Post-quantum cryptography implementation
- Supply chain security automation
4. **Platform-Specific Security**
- macOS security internals and forensics
- Cloud provider-specific security controls
- OT/ICS and critical infrastructure protection
- Mobile and IoT security frameworks
---
## ⚖️ Ethical Considerations
### Responsible AI Guidelines
- **Defensive Focus**: All content emphasizes protection and defense, never attack techniques
- **Refusal Patterns**: Built-in responses for rejecting malicious requests
- **Dual-Use Prevention**: Careful curation to avoid enabling harmful activities
- **Privacy Protection**: No PII or sensitive organizational data included
- **Bias Mitigation**: Balanced representation across vendors, platforms, and methodologies
### Usage Restrictions
- Not for developing offensive security tools
- Not for bypassing security controls
- Not for unauthorized access or exploitation
- Must comply with local laws and regulations
---
## 🚧 Known Limitations
1. **Language**: English-only (multilingual expansion planned)
2. **Temporal**: Knowledge cutoff varies by source (majority 2024-2025)
3. **Geographic Bias**: Western-centric frameworks and regulations
4. **Rapid Evolution**: Security landscape changes require regular updates
5. **Complexity Balance**: Some topics may be too advanced for general practitioners
---
## 📚 Citation
```bibtex
@dataset{trendyol_2025_cybersec_v2,
author = {{Trendyol Security Team}},
title = {Trendyol Cybersecurity Defense Instruction-Tuning Dataset v2.0},
year = {2025},
month = {7},
publisher = {Hugging Face},
version = {2.0.0},
}
```
---
## 🤝 Contributing
We welcome contributions from the security community! Please ensure:
- ✅ Defensive security focus
- ✅ Technical accuracy with references
- ✅ Follows the dataset schema
- ✅ Passes quality checks
- ✅ Includes appropriate documentation
---
## 🙏 Acknowledgments
Special thanks to the global cybersecurity community, security researchers, and open-source contributors who made this dataset possible. This work builds upon decades of collective knowledge in defensive security practices.
---
## 📜 Changelog
- **v2.0.0** (2025-07-30): Major expansion to 53K+ examples, 200+ topics, platform-specific content
---
<div align="center">
<i>Building a safer digital future through responsible AI and collaborative security intelligence.</i>
</div>
---
license: Apache 2.0许可证
task_categories:
- 文本生成
- 问答
language:
- 英语
tags:
- 网络安全
- 防御安全
- 指令微调(instruction-tuning)
- 威胁情报(threat-intelligence)
- 事件响应(incident-response)
- 安全运营(security-operations)
pretty_name: Trendyol网络安全防御数据集
size_categories:
- 10K<n<100K
dataset_info:
version: 1.0.0
---
# Trendyol网络安全防御指令微调数据集(v2.0)
<div align="center">
<img src="https://img.shields.io/badge/rows-53,202-brightgreen" alt="数据集规模">
<img src="https://img.shields.io/badge/license-Apache%202.0-blue" alt="许可证">
<img src="https://img.shields.io/badge/language-English-red" alt="语言">
<img src="https://img.shields.io/badge/version-2.0.0-orange" alt="版本">
</div>
## 🚀 核心摘要
**53,202条**精心遴选的*系统/用户/助手*格式指令微调示例,覆盖**200+细分网络安全领域**。该数据集由Trendyol安全团队打造,用于训练前沿防御安全AI智能体。数据集从21,000行扩展至53,000行,全面涵盖现代安全挑战,包括云原生威胁、AI/ML安全、量子计算风险与高级事件响应技术。
---
## 📊 v2.0版本更新说明(2025-07-30)
| 指标 | v1.1 | v2.0 | 变更幅度 |
|--------|------|------|--------|
| **总样本数** | 21,258 | **53,202** | +150.3% |
| **唯一主题数** | 50+ | **200+** | +300% |
| **覆盖深度** | 基础-中级 | **基础-专家级** | 全面升级 |
| **专业领域** | 传统安全 | **新增AI/ML、量子计算、云原生、OT/ICS** | 大幅扩展 |
| **框架集成** | MITRE ATT&CK、NIST | **新增STIX/TAXII、钻石模型(Diamond Model)、零信任(Zero Trust)** | 覆盖更全面 |
| **特定平台支持** | 通用场景 | **新增macOS、云厂商、容器编排平台** | 针对性更强 |
### 🎯 v2.0主要新增内容
- **高级威胁情报**:5G网络、AI驱动分析、量子计算威胁
- **云原生安全**:Kubernetes取证、无服务器安全、多云环境
- **新兴技术安全**:后量子密码学、DNA计算安全、超材料计算安全
- **特定平台安全**:深度macOS安全分析、云厂商专属取证
- **运营优化**:安全编排自动化与响应(SOAR)自动化、威胁狩猎指标、事件响应编排
---
## 📋 数据集概览
| 属性 | 取值 |
|----------|-------|
| **语言** | 英语 |
| **许可证** | Apache 2.0 |
| **存储格式** | Parquet(优化列式存储) |
| **总样本数** | 53,202 |
| **字段** | `system`、`user`、`assistant` |
| **数据划分** | 训练集(90%)、验证集(5%)、测试集(5%) |
| **平均响应长度** | ~700个Token(Token) |
| **压缩比** | 0.72 |
### 📊 主题分布
云安全与DevSecOps : 18.5%
威胁情报与狩猎 : 16.2%
事件响应与取证 : 14.8%
AI/ML安全 : 12.3%
网络与协议安全 : 11.7%
身份与访问管理 : 9.4%
新兴技术安全 : 8.6%
特定平台安全 : 5.3%
合规与治理 : 3.2%
---
## 🏗️ 数据集结构
### 字段说明
| 字段 | 类型 | 描述 | 示例 |
|-------|------|-------------|---------|
| `system` | *字符串* | 角色定义与伦理准则 | "You are an expert cybersecurity professional..." |
| `user` | *字符串* | 贴合实际的安全问题/场景 | "How can I detect API gateway abuse in microservices?" |
| `assistant` | *字符串* | 全面的技术响应 | "API gateway abuse detection requires multi-layered..." |
### 数据划分
python
{
"train": 47,882, # 90%
"validation": 2,660, # 5%
"test": 2,660 # 5%
}
---
## 🔬 数据集构建流程
### 1. **高级内容遴选**(50万+数据源)
- 技术博客、安全公告、CVE数据库
- 学术论文、国际安全会议论文(BlackHat、DEF CON、RSA)
- 行业报告、威胁情报源
- 特定平台官方文档(AWS、Azure、GCP、macOS)
- 监管框架与合规标准
### 2. **多阶段处理流水线**
原始内容 → 语言检测 → 主题分类 →
指令合成 → 质量验证 → 专家评审 →
伦理过滤 → 最终数据集
### 3. **质量保障框架**
- **自动化检查**:语法校验、技术准确性核查、响应完整性验证
- **去重处理**:基于语义相似度的高级MinHash LSH算法
- **幻觉检测**:针对权威来源的事实核查
- **伦理合规**:过滤攻击性内容、防范两用技术滥用
- **专家验证**:10%样本由安全专业人员手动审核
### 4. **主题覆盖验证**
- 全面映射至行业标准框架(MITRE ATT&CK、NIST、ISO 27001)
- 与当前威胁态势报告交叉验证
- 基于真实世界事件模式进行校验
---
## 💻 使用示例
### 基础加载方式
python
from datasets import load_dataset
# 加载完整训练集
dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="train")
# 加载指定划分数据集
val_dataset = load_dataset("TrendyolSecurity/cybersecurity-defense-v2", split="validation")
# 查看第一条样本
print(f"System: {dataset[0]['system']}")
print(f"User: {dataset[0]['user']}")
print(f"Assistant: {dataset[0]['assistant']}")
### 微调配置示例
python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model
# 模型配置
model = AutoModelForCausalLM.from_pretrained("base-model-name", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("base-model-name")
# 用于高效微调的LoRA配置
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]
)
# 训练参数配置
training_args = TrainingArguments(
output_dir="./cybersec-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
logging_steps=25,
save_strategy="epoch",
evaluation_strategy="epoch",
learning_rate=2e-4,
bf16=True,
gradient_checkpointing=True,
)
# 初始化训练器
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
max_seq_length=4096,
dataset_text_field="text", # 拼接后的字段
)
---
## 🎯 专业覆盖领域
### 🔐 包含的高级主题
1. **云原生安全**
- 多云取证与事件响应
- 容器与Kubernetes安全
- 无服务器与FaaS安全模式
- 云原生应用保护(CNAPP)
2. **AI/ML安全**
- 对抗机器学习防御
- 模型投毒检测
- AI驱动威胁情报
- 用于威胁共享的联邦学习
3. **新兴威胁**
- 5G网络安全与边缘计算
- 量子计算威胁态势
- 后量子密码学实现
- 供应链安全自动化
4. **特定平台安全**
- macOS安全内核与取证
- 云厂商专属安全控制
- OT/ICS与关键基础设施防护
- 移动与IoT安全框架
---
## ⚖️ 伦理考量
### 负责任AI指南
- **防御导向**:所有内容均强调防护与防御,绝不涉及攻击技术
- **拒绝机制**:内置响应逻辑以拒绝恶意请求
- **两用技术防范**:精心遴选内容,避免助力有害活动
- **隐私保护**:不包含个人可识别信息(PII)或敏感组织数据
- **偏差缓解**:在厂商、平台与方法论间保持均衡覆盖
### 使用限制
- 不得用于开发攻击性安全工具
- 不得用于绕过安全控制
- 不得用于未授权访问或利用行为
- 必须遵守当地法律法规
---
## 🚧 已知局限性
1. **语言限制**:仅支持英语(计划推出多语言版本)
2. **时效性**:知识截止日期因数据源而异(大部分内容截至2024-2025年)
3. **地域偏差**:以西方式框架与监管标准为主
4. **领域快速演变**:安全领域变化迅速,需定期更新数据集
5. **复杂度平衡**:部分主题对普通安全从业者可能过于专业
---
## 📚 引用格式
bibtex
@dataset{trendyol_2025_cybersec_v2,
author = {{Trendyol安全团队}},
title = {Trendyol网络安全防御指令微调数据集v2.0},
year = {2025},
month = {7},
publisher = {Hugging Face},
version = {2.0.0},
}
---
## 🤝 贡献指南
我们欢迎安全社区的贡献!请确保:
- ✅ 内容聚焦防御安全
- ✅ 技术准确且附带引用
- ✅ 遵循数据集字段规范
- ✅ 通过质量检查
- ✅ 包含适当文档说明
---
## 🙏 致谢
特别感谢全球网络安全社区、安全研究者与开源贡献者,本数据集基于数十年防御安全实践的集体知识构建。
---
## 📜 变更日志
- **v2.0.0**(2025-07-30):大幅扩展至53,000+样本、200+主题,新增特定平台内容
---
<div align="center">
<i>通过负责任的AI与协作安全情报,构建更安全的数字未来。</i>
</div>
提供机构:
invinciblejha01



