witfoo/precinct6-cybersecurity-100m
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/witfoo/precinct6-cybersecurity-100m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
- graph-ml
language:
- en
tags:
- cybersecurity
- intrusion-detection
- provenance-graphs
- MITRE-ATT&CK
- SOAR
- security-operations
- IDS
- network-security
- threat-detection
- labeled-dataset
- lead-rules
size_categories:
- 100M<n<1B
dataset_info:
- config_name: signals
splits:
- name: train
num_examples: 114074530
configs:
- config_name: signals
data_files:
- split: train
path: signals/*.parquet
- config_name: graph_nodes
data_files:
- split: train
path: graph/nodes.jsonl
- config_name: graph_edges
data_files:
- split: train
path: graph/edges.jsonl
- config_name: incidents
data_files:
- split: train
path: graph/incidents.jsonl
---
# WitFoo Precinct6 Cybersecurity Dataset (114M)
## Overview
A large-scale, labeled cybersecurity dataset derived from production Security Operations Center (SOC) data processed by [WitFoo Precinct](https://www.witfoo.com/) version 6.x. This dataset contains **114 million sanitized security events** (signal logs) and **provenance graphs** (10,442 incident graphs with 23,362 nodes and 32,732,650 edges) from real enterprise network monitoring across 5 organizations.
**Available in two sizes:**
- [`witfoo/precinct6-cybersecurity`](https://huggingface.co/datasets/witfoo/precinct6-cybersecurity) — 2M signals (smaller, faster to load)
- [`witfoo/precinct6-cybersecurity-100m`](https://huggingface.co/datasets/witfoo/precinct6-cybersecurity-100m) — **114M signals (this dataset)**
**Generate your own:** WitFoo Precinct 6.x customers can create datasets from their own data using the open-source pipeline: [`witfoo/dataset-from-precinct6`](https://github.com/witfoo/dataset-from-precinct6)
This dataset is designed to support research in:
- **Provenance graph-based intrusion detection** (KnowHow, NodLink, and similar systems)
- **AI-driven cyber defense simulation** (CybORG and MARL-based defense policy training)
- **Security alert classification** (malicious vs. suspicious vs. benign event labeling)
- **Attack lifecycle analysis** using MITRE ATT&CK framework mappings
- **Detection rule evaluation** using WitFoo's 261 lead detection rules
## Quick Start
```python
from datasets import load_dataset
# Load flat signal logs (114M rows across 58 Parquet shards)
signals = load_dataset("witfoo/precinct6-cybersecurity-100m", "signals", split="train")
# Find malicious events (from confirmed incidents)
malicious = signals.filter(lambda x: x["label_binary"] == "malicious")
# Find suspicious events (matched detection rules)
suspicious = signals.filter(lambda x: x["label_binary"] == "suspicious")
# Query by product/vendor
cisco_events = signals.filter(lambda x: x["vendor_name"] == "Cisco")
# Load provenance graph
nodes = load_dataset("witfoo/precinct6-cybersecurity-100m", "graph_nodes", split="train")
edges = load_dataset("witfoo/precinct6-cybersecurity-100m", "graph_edges", split="train")
# Load full incident graphs
incidents = load_dataset("witfoo/precinct6-cybersecurity-100m", "incidents", split="train")
```
## Label Distribution
| Label | Count | Percentage |
|-------|-------|------------|
| `benign` | 113,326,050 | 99.34% |
| `malicious` | 125,780 | 0.11% |
| `suspicious` | 622,700 | 0.55% |
## Signal Columns
| Column | Type | Description |
|--------|------|-------------|
| `timestamp` | float | Unix epoch timestamp |
| `message_type` | string | Event classification (e.g., `firewall_action`, `account_logon`, `AssumeRole`) |
| `stream_name` | string | Source product/data stream (see Source Products below) |
| `pipeline` | string | Ingestion pipeline |
| `src_ip` | string | Source IP (sanitized) |
| `dst_ip` | string | Destination IP (sanitized) |
| `src_port` | string | Source port |
| `dst_port` | string | Destination port |
| `protocol` | string | Network protocol (6=TCP, 17=UDP, 1=ICMP) |
| `src_host` | string | Source hostname (sanitized) |
| `dst_host` | string | Destination hostname (sanitized) |
| `username` | string | Associated username (sanitized) |
| `action` | string | Event action (block, permit, logon, logoff) |
| `severity` | string | Severity level |
| `vendor_code` | string | Vendor-specific event code |
| `message_sanitized` | string | Full sanitized raw log message |
| `label_binary` | string | `malicious`, `suspicious`, or `benign` |
| `label_confidence` | float | Confidence score (0.0–1.0) |
| `suspicion_score` | float | WitFoo suspicion score (0.0–1.0) |
| `mo_name` | string | Modus operandi (e.g., `Data Theft`) |
| `lifecycle_stage` | string | Kill chain stage (e.g., `initial-compromise`, `complete-mission`) |
| `matched_rules` | string | JSON array of matched WitFoo lead rule descriptions |
| `set_roles` | string | JSON array of classification roles (e.g., `Exploiting Host`, `C2 Server`) |
| `product_name` | string | Security product name (e.g., `ASA Firewall`, `Falcon`) |
| `vendor_name` | string | Product vendor (e.g., `Cisco`, `Crowdstrike`) |
## Source Products
The dataset contains events from **158 security products** across **70+ vendors**. Complete catalog in `reference/lead_rules_catalog.json`.
| Category | Products |
|----------|----------|
| **Firewalls** | Cisco ASA, Palo Alto PAN NGFW, Fortinet FortiGate, Checkpoint, Meraki, SonicWall, pfSense, Barracuda, Juniper SRX |
| **Endpoint Protection** | CrowdStrike Falcon, Symantec SEP, Carbon Black, Cylance, SentinelOne, Deep Instinct, Sophos, McAfee, ESET |
| **Network Detection** | Cisco Stealthwatch, Cisco Firepower, Suricata IDS, TippingPoint IPS, Vectra Cognito |
| **Identity & Access** | Microsoft Windows AD, Cisco ISE, Centrify, CyberArk, Duo, Okta, Beyond Trust |
| **Cloud Security** | AWS CloudTrail, AWS VPC Flow Logs, AWS GuardDuty, Azure Security, Zscaler, Netskope, Cisco Umbrella |
| **Email Security** | ProofPoint, Mimecast, FireEye EX, Barracuda ESS, Cisco IronPort, Checkpoint Harmony |
| **Threat Intelligence** | FireEye NX/HX/AX/CMS, Trend Micro, QRadar, Microsoft ATA, Cortex XDR |
| **Infrastructure** | VMware vCenter/NSX, Elastic Filebeat, Linux (sshd, PAM, systemd, auditd), Apache, HAProxy |
## Labeling Methodology
**Three-tier labels** derived from two sources:
- **`malicious`** (125,780): Events embedded as leads inside confirmed incidents. Extracted directly from incident lead objects with suspicion scores, modus operandi, and MITRE mappings.
- **`suspicious`** (622,700): Events matching WitFoo's 261 lead detection rules (e.g., "ASA Deny", "Windows Failed Login", "CrowdStrike Detection") but not in confirmed incidents.
- **`benign`** (113,326,050): Events not matching any detection rules and not in any incident.
The `matched_rules` column shows which detection rules matched. The `set_roles` column shows WitFoo classification roles (Exploiting Host, C2 Server, Staging Target, etc.).
## Lead Detection Rules
261 rules included in `reference/lead_rules_catalog.json`. Examples:
| Rule | Criteria | Roles |
|------|----------|-------|
| Blocked Action | Any firewall block | Exploiting Host → Exploiting Target |
| ASA Deny | cisco_asa + deny | Exploiting Host → Exploiting Target |
| Windows Failed Login | Event ID 4625 | Exploiting Target → Exploiting Host |
| CrowdStrike Detection | CrowdStrike stream | Exploiting Target → Exploiting Host |
| AWS VPC Reject | VPC flow + REJECT | Exploiting Host → Exploiting Target |
| Authentication Failure | auth_failure type | Exploiting Host → Exploiting Target |
| Audit log cleared | Event ID 1102 | Exploiting Target → Exploiting Host |
## Sanitization
All customer-identifying information has been removed through a comprehensive, iterative four-layer sanitization pipeline. Quality was prioritized over processing speed. The pipeline is [open source](https://github.com/witfoo/dataset-from-precinct6) under the Apache 2.0 license.
1. **Structured field sanitization + Aho-Corasick multi-pattern sweep** — Known fields are replaced with deterministic tokens (IPs to [RFC 5737](https://datatracker.ietf.org/doc/html/rfc5737) ranges, hostnames to `HOST-NNNN`, etc.). Every record is then swept using an [Aho-Corasick automaton](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) built from all 302,000+ registry entries to catch PII in unexpected contexts. Product identifiers are explicitly protected from the sweep.
2. **Format-specific log message parsing** — Eight specialized parsers handle Cisco ASA syslog, Windows Security Event XML, WinLogBeat JSON, AWS CloudTrail, Palo Alto Networks, VMware vCenter, DNS logs, and a generic fallback — sanitizing PII within structured fields like XML elements, nested JSON, and CSV columns.
3. **Machine learning residual detection** — [Microsoft Presidio](https://microsoft.github.io/presidio/) (spaCy NLP) and [BERT NER](https://huggingface.co/dslim/bert-base-NER) scan sanitized records for residual PII that pattern-based approaches miss. New discoveries trigger full re-sanitization.
4. **Large language model contextual review** — [Claude](https://www.anthropic.com/claude) reviews stratified samples for subtle PII: embedded org names, internal hostnames revealing organizational structure, employee names in file paths, AD group names, and geographic identifiers. Findings trigger re-sanitization.
**Iterative convergence:** The four layers run in cycles. PII discovered by ML/AI in one cycle is caught automatically by Layer 1 in all subsequent cycles across the complete dataset. Cycles repeat until near-zero new discoveries.
**Final PII registry:** ~302,000 unique mappings across 14 categories (IPs, hostnames, usernames, orgs, credentials, SIDs, emails, ARNs, etc.). All replacements are consistent — the same original value always maps to the same token, preserving graph topology.
## Graph Data
| Component | Count |
|-----------|-------|
| Nodes | 23,362 |
| Edges | 32,732,650 |
| Incidents | 10,442 (Data Theft: 10,441, Phishing: 1) |
## Limitations
- **Label imbalance**: 99.77% benign reflects production SOC reality. Sampling strategies needed for balanced training.
- **Temporal scope**: July–August 2024
- **Sanitization**: ~302,000 PII registry entries used for consistent replacement across all records
- **Shared incidents**: Same 10,442 incidents appear in both 2M and 114M datasets
- **Sanitization trade-offs**: Some log message detail reduced by PII replacement
## Citation
```bibtex
@dataset{witfoo_precinct6_100m_2025,
title={WitFoo Precinct6 Cybersecurity Dataset (114M)},
author={WitFoo, Inc.},
year={2025},
url={https://huggingface.co/datasets/witfoo/precinct6-cybersecurity-100m},
license={Apache-2.0}
}
```
## License
[Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)
---
license: Apache 2.0
task_categories:
- 文本分类
- 图机器学习
language:
- 英语
tags:
- 网络安全
- 入侵检测
- 溯源图(Provenance Graphs)
- MITRE ATT&CK框架
- 安全编排自动化与响应(Security Orchestration, Automation and Response, SOAR)
- 安全运营
- 入侵检测系统(Intrusion Detection System, IDS)
- 网络安全
- 威胁检测
- 标注数据集
- 前置检测规则
size_categories:
- 100M < 数据量 < 1B
dataset_info:
- config_name: signals
splits:
- name: train
num_examples: 114074530
configs:
- config_name: signals
data_files:
- split: train
path: signals/*.parquet
- config_name: graph_nodes
data_files:
- split: train
path: graph/nodes.jsonl
- config_name: graph_edges
data_files:
- split: train
path: graph/edges.jsonl
- config_name: incidents
data_files:
- split: train
path: graph/incidents.jsonl
---
# WitFoo Precinct6 网络安全数据集(114M)
## 概述
本数据集为大规模标注型网络安全数据集,源自经WitFoo Precinct(版本6.x)处理的生产级安全运营中心(Security Operations Center, SOC)数据。本数据集包含来自5家企业真实网络监控数据的1.14亿条脱敏安全事件(信号日志),以及**溯源图(Provenance Graphs)**:共计10442个事件溯源图,包含23362个节点与32732650条边。
本数据集提供两种规格:
- [`witfoo/precinct6-cybersecurity`](https://huggingface.co/datasets/witfoo/precinct6-cybersecurity):包含200万条信号日志,体积更小,加载速度更快
- [`witfoo/precinct6-cybersecurity-100m`](https://huggingface.co/datasets/witfoo/precinct6-cybersecurity-100m):包含1.14亿条信号日志,即本数据集
自行生成数据集:WitFoo Precinct 6.x的用户可通过开源工具链[`witfoo/dataset-from-precinct6`](https://github.com/witfoo/dataset-from-precinct6)基于自有数据生成专属数据集。
本数据集可支持以下方向的研究:
- 基于溯源图的入侵检测(如KnowHow、NodLink及同类系统)
- AI驱动的网络防御仿真(如CybORG及基于多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的防御策略训练)
- 安全告警分类(恶意/可疑/良性事件标注)
- 基于MITRE ATT&CK框架映射的攻击生命周期分析
- 基于WitFoo的261条前置检测规则的检测规则评估
## 快速入门
python
from datasets import load_dataset
# 加载扁平化信号日志(58个Parquet分片,共1.14亿条数据)
signals = load_dataset("witfoo/precinct6-cybersecurity-100m", "signals", split="train")
# 提取恶意事件(来自已确认的安全事件)
malicious = signals.filter(lambda x: x["label_binary"] == "malicious")
# 提取可疑事件(匹配检测规则的事件)
suspicious = signals.filter(lambda x: x["label_binary"] == "suspicious")
# 按产品/厂商筛选事件
cisco_events = signals.filter(lambda x: x["vendor_name"] == "Cisco")
# 加载溯源图数据
nodes = load_dataset("witfoo/precinct6-cybersecurity-100m", "graph_nodes", split="train")
edges = load_dataset("witfoo/precinct6-cybersecurity-100m", "graph_edges", split="train")
# 加载完整的事件溯源图
incidents = load_dataset("witfoo/precinct6-cybersecurity-100m", "incidents", split="train")
## 标签分布
| 标签 | 数量 | 占比 |
|-------|-------|------------|
| `benign`(良性) | 113,326,050 | 99.34% |
| `malicious`(恶意) | 125,780 | 0.11% |
| `suspicious`(可疑) | 622,700 | 0.55% |
## 信号日志字段说明
| 字段名 | 数据类型 | 描述 |
|--------|------|-------------|
| `timestamp` | 浮点型 | Unix纪元时间戳 |
| `message_type` | 字符串型 | 事件分类(如`firewall_action`防火墙动作、`account_logon`账户登录、`AssumeRole`角色切换) |
| `stream_name` | 字符串型 | 源产品/数据流名称 |
| `pipeline` | 字符串型 | 数据摄入管道 |
| `src_ip` | 字符串型 | 源IP地址(已脱敏) |
| `dst_ip` | 字符串型 | 目的IP地址(已脱敏) |
| `src_port` | 字符串型 | 源端口 |
| `dst_port` | 字符串型 | 目的端口 |
| `protocol` | 字符串型 | 网络协议(6=TCP,17=UDP,1=ICMP) |
| `src_host` | 字符串型 | 源主机名(已脱敏) |
| `dst_host` | 字符串型 | 目的主机名(已脱敏) |
| `username` | 字符串型 | 关联用户名(已脱敏) |
| `action` | 字符串型 | 事件动作(拦截、允许、登录、登出等) |
| `severity` | 字符串型 | 事件严重级别 |
| `vendor_code` | 字符串型 | 厂商专属事件代码 |
| `message_sanitized` | 字符串型 | 完整脱敏后的原始日志消息 |
| `label_binary` | 字符串型 | 二分类标签:`malicious`(恶意)、`suspicious`(可疑)或`benign`(良性) |
| `label_confidence` | 浮点型 | 置信度得分(取值范围0.0~1.0) |
| `suspicion_score` | 浮点型 | WitFoo可疑度得分(取值范围0.0~1.0) |
| `mo_name` | 字符串型 | 作案手法(如`Data Theft`数据窃取) |
| `lifecycle_stage` | 字符串型 | 攻击杀伤链阶段(如`initial-compromise`初始入侵、`complete-mission`任务完成) |
| `matched_rules` | 字符串型 | 匹配的WitFoo前置检测规则描述的JSON数组 |
| `set_roles` | 字符串型 | 分类角色的JSON数组(如`Exploiting Host`攻击主机、`C2 Server`命令与控制服务器) |
| `product_name` | 字符串型 | 安全产品名称(如`ASA Firewall`ASA防火墙、`Falcon`猎鹰终端防护) |
| `vendor_name` | 字符串型 | 产品厂商(如`Cisco`思科、`Crowdstrike` CrowdStrike) |
## 源产品覆盖范围
本数据集包含来自70余家厂商的158款安全产品的事件数据,完整产品目录见`reference/lead_rules_catalog.json`文件。
| 产品类别 | 覆盖产品 |
|----------|----------|
| **防火墙** | 思科ASA、Palo Alto Networks PAN下一代防火墙、飞塔FortiGate、Check Point、思科Meraki、SonicWall、pfSense、梭子鱼(Barracuda)、瞻博SRX |
| **终端防护** | CrowdStrike猎鹰、赛门铁克SEP、Carbon Black、Cylance、SentinelOne、Deep Instinct、Sophos、迈克菲、ESET |
| **网络检测** | 思科Stealthwatch、思科Firepower、Suricata入侵检测系统、TippingPoint入侵防御系统、Vectra Cognito |
| **身份与访问管理** | 微软Windows活动目录(Active Directory, AD)、思科ISE、Centrify、CyberArk、Duo、Okta、BeyondTrust |
| **云安全** | AWS CloudTrail、AWS VPC流量日志、AWS GuardDuty、Azure安全中心、Zscaler、Netskope、思科Umbrella |
| **邮件安全** | ProofPoint、Mimecast、FireEye EX、梭子鱼ESS、思科IronPort、Check Point Harmony |
| **威胁情报** | FireEye NX/HX/AX/CMS、趋势科技、QRadar、微软高级威胁分析(Advanced Threat Analytics, ATA)、Cortex XDR |
| **基础设施** | VMware vCenter/NSX、Elastic Filebeat、Linux(sshd、PAM、systemd、auditd)、Apache、HAProxy |
## 标注方法论
本数据集采用三级标签体系,标签来源于两类数据源:
- **`malicious`(恶意)**(共125,780条):来自已确认安全事件中作为前置线索的事件,直接从事件线索对象中提取,包含可疑度得分、作案手法及MITRE ATT&CK映射关系。
- **`suspicious`(可疑)**(共622,700条):匹配WitFoo的261条前置检测规则(如“ASA拦截”“Windows登录失败”“CrowdStrike检测”)但未纳入已确认安全事件的事件。
- **`benign`(良性)**(共113,326,050条):未匹配任何检测规则且未纳入任何安全事件的事件。
`matched_rules`列展示了匹配的检测规则,`set_roles`列展示了WitFoo的分类角色(如攻击主机、命令与控制服务器、跳板目标等)。
## 前置检测规则说明
本数据集包含261条前置检测规则,存储于`reference/lead_rules_catalog.json`文件中,示例如下:
| 规则名称 | 匹配条件 | 分类角色 |
|------|----------|-------|
| 拦截动作 | 任意防火墙拦截动作 | 攻击主机 → 攻击目标 |
| ASA拦截 | cisco_asa + deny动作 | 攻击主机 → 攻击目标 |
| Windows登录失败 | 事件ID 4625 | 攻击目标 → 攻击主机 |
| CrowdStrike检测 | CrowdStrike数据流 | 攻击目标 → 攻击主机 |
| AWS VPC拦截 | VPC流量日志 + REJECT动作 | 攻击主机 → 攻击目标 |
| 认证失败 | auth_failure类型事件 | 攻击主机 → 攻击目标 |
| 审计日志清除 | 事件ID 1102 | 攻击目标 → 攻击主机 |
## 数据脱敏处理
本数据集通过一套全面的迭代式四层脱敏流程移除了所有客户识别信息,流程优先级优先保证数据质量而非处理速度。该脱敏工具链以Apache 2.0许可证开源,仓库地址为[`witfoo/dataset-from-precinct6`](https://github.com/witfoo/dataset-from-precinct6)。
1. **结构化字段脱敏+Aho-Corasick多模式扫描**:将已知字段替换为确定性Token(如IP地址替换为[RFC 5737](https://datatracker.ietf.org/doc/html/rfc5737)规定的保留地址段,主机名替换为`HOST-NNNN`格式等)。随后使用基于302,000+条注册表条目构建的[Aho-Corasick自动机](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm)对每条记录进行扫描,以捕获隐藏在非预期位置的个人可识别信息(Personally Identifiable Information, PII),同时明确保护产品标识符不受扫描影响。
2. **格式专属日志消息解析**:内置8种专用解析器,分别处理思科ASA系统日志、Windows安全事件XML、WinLogBeat JSON、AWS CloudTrail、Palo Alto Networks日志、VMware vCenter日志、DNS日志,并提供通用回退解析器,用于在XML元素、嵌套JSON、CSV列等结构化字段中脱敏PII。
3. **机器学习残留PII检测**:使用[Microsoft Presidio](https://microsoft.github.io/presidio/)(基于spaCy自然语言处理)和[BERT命名实体识别(Named Entity Recognition, NER)](https://huggingface.co/dslim/bert-base-NER)对已脱敏的记录进行扫描,以检测基于模式匹配方法遗漏的残留PII。一旦发现新的PII,将触发全量数据重新脱敏。
4. **大语言模型上下文审查**:使用[Claude](https://www.anthropic.com/claude)对分层抽样样本进行审查,以识别隐蔽的PII:如嵌入的组织名称、暴露组织架构的内部主机名、文件路径中的员工姓名、活动目录(Active Directory, AD)组名及地理标识符。一旦发现新的PII,将触发重新脱敏流程。
**迭代收敛机制**:四层脱敏流程循环执行。在某一轮循环中通过机器学习/人工智能发现的PII,将在后续所有循环中由第一层流程自动捕获,直至几乎无新PII被发现。
**最终PII映射注册表**:包含14类共约302,000条唯一映射条目(如IP地址、主机名、用户名、组织名称、凭证、安全标识符(Security Identifier, SID)、电子邮件、亚马逊资源名称(Amazon Resource Name, ARN)等)。所有替换操作均保持一致性:同一原始值始终映射至同一Token,以此保留溯源图的拓扑结构。
## 溯源图数据统计
| 组件类型 | 数量 |
|-----------|-------|
| 节点 | 23,362 |
| 边 | 32,732,650 |
| 安全事件 | 10,442(其中数据窃取事件10,441起,钓鱼事件1起) |
## 数据集局限性
- **标签不平衡**:99.77%的样本为良性事件,这符合生产级SOC的真实数据分布,因此在进行平衡训练时需要采用合适的采样策略。
- **时间范围**:数据采集自2024年7月至8月。
- **脱敏处理**:使用约302,000条PII映射条目实现全数据集的一致性替换。
- **共享安全事件**:200万条和1.14亿条两个规格的数据集中均包含同一批10,442个安全事件。
- **脱敏权衡**:由于PII替换操作,部分日志消息的细节有所丢失。
## 引用格式
bibtex
@dataset{witfoo_precinct6_100m_2025,
title={WitFoo Precinct6 Cybersecurity Dataset (114M)},
author={WitFoo, Inc.},
year={2025},
url={https://huggingface.co/datasets/witfoo/precinct6-cybersecurity-100m},
license={Apache-2.0}
}
## 授权协议
本数据集采用[Apache许可证2.0](https://www.apache.org/licenses/LICENSE-2.0)授权。
提供机构:
witfoo



