five

bijaye/precinct6-cybersecurity-100m

收藏
Hugging Face2026-04-20 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/bijaye/precinct6-cybersecurity-100m
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-classification - graph-ml language: - en tags: - cybersecurity - intrusion-detection - provenance-graphs - MITRE-ATT&CK - SOAR - security-operations - IDS - network-security - threat-detection - labeled-dataset - lead-rules size_categories: - 100M<n<1B dataset_info: - config_name: signals splits: - name: train num_examples: 114074530 configs: - config_name: signals data_files: - split: train path: signals/*.parquet - config_name: graph_nodes data_files: - split: train path: graph/nodes.jsonl - config_name: graph_edges data_files: - split: train path: graph/edges.jsonl - config_name: incidents data_files: - split: train path: graph/incidents.jsonl --- # WitFoo Precinct6 Cybersecurity Dataset (114M) ## Overview A large-scale, labeled cybersecurity dataset derived from production Security Operations Center (SOC) data processed by [WitFoo Precinct](https://www.witfoo.com/) version 6.x. This dataset contains **114 million sanitized security events** (signal logs) and **provenance graphs** (10,442 incident graphs with 23,362 nodes and 32,732,650 edges) from real enterprise network monitoring across 5 organizations. **Available in two sizes:** - [`witfoo/precinct6-cybersecurity`](https://huggingface.co/datasets/witfoo/precinct6-cybersecurity) — 2M signals (smaller, faster to load) - [`witfoo/precinct6-cybersecurity-100m`](https://huggingface.co/datasets/witfoo/precinct6-cybersecurity-100m) — **114M signals (this dataset)** **Generate your own:** WitFoo Precinct 6.x customers can create datasets from their own data using the open-source pipeline: [`witfoo/dataset-from-precinct6`](https://github.com/witfoo/dataset-from-precinct6) This dataset is designed to support research in: - **Provenance graph-based intrusion detection** (KnowHow, NodLink, and similar systems) - **AI-driven cyber defense simulation** (CybORG and MARL-based defense policy training) - **Security alert classification** (malicious vs. suspicious vs. benign event labeling) - **Attack lifecycle analysis** using MITRE ATT&CK framework mappings - **Detection rule evaluation** using WitFoo's 261 lead detection rules ## Quick Start ```python from datasets import load_dataset # Load flat signal logs (114M rows across 58 Parquet shards) signals = load_dataset("witfoo/precinct6-cybersecurity-100m", "signals", split="train") # Find malicious events (from confirmed incidents) malicious = signals.filter(lambda x: x["label_binary"] == "malicious") # Find suspicious events (matched detection rules) suspicious = signals.filter(lambda x: x["label_binary"] == "suspicious") # Query by product/vendor cisco_events = signals.filter(lambda x: x["vendor_name"] == "Cisco") # Load provenance graph nodes = load_dataset("witfoo/precinct6-cybersecurity-100m", "graph_nodes", split="train") edges = load_dataset("witfoo/precinct6-cybersecurity-100m", "graph_edges", split="train") # Load full incident graphs incidents = load_dataset("witfoo/precinct6-cybersecurity-100m", "incidents", split="train") ``` ## Label Distribution | Label | Count | Percentage | |-------|-------|------------| | `benign` | 113,326,050 | 99.34% | | `malicious` | 125,780 | 0.11% | | `suspicious` | 622,700 | 0.55% | ## Signal Columns | Column | Type | Description | |--------|------|-------------| | `timestamp` | float | Unix epoch timestamp | | `message_type` | string | Event classification (e.g., `firewall_action`, `account_logon`, `AssumeRole`) | | `stream_name` | string | Source product/data stream (see Source Products below) | | `pipeline` | string | Ingestion pipeline | | `src_ip` | string | Source IP (sanitized) | | `dst_ip` | string | Destination IP (sanitized) | | `src_port` | string | Source port | | `dst_port` | string | Destination port | | `protocol` | string | Network protocol (6=TCP, 17=UDP, 1=ICMP) | | `src_host` | string | Source hostname (sanitized) | | `dst_host` | string | Destination hostname (sanitized) | | `username` | string | Associated username (sanitized) | | `action` | string | Event action (block, permit, logon, logoff) | | `severity` | string | Severity level | | `vendor_code` | string | Vendor-specific event code | | `message_sanitized` | string | Full sanitized raw log message | | `label_binary` | string | `malicious`, `suspicious`, or `benign` | | `label_confidence` | float | Confidence score (0.0–1.0) | | `suspicion_score` | float | WitFoo suspicion score (0.0–1.0) | | `mo_name` | string | Modus operandi (e.g., `Data Theft`) | | `lifecycle_stage` | string | Kill chain stage (e.g., `initial-compromise`, `complete-mission`) | | `matched_rules` | string | JSON array of matched WitFoo lead rule descriptions | | `set_roles` | string | JSON array of classification roles (e.g., `Exploiting Host`, `C2 Server`) | | `product_name` | string | Security product name (e.g., `ASA Firewall`, `Falcon`) | | `vendor_name` | string | Product vendor (e.g., `Cisco`, `Crowdstrike`) | ## Source Products The dataset contains events from **158 security products** across **70+ vendors**. Complete catalog in `reference/lead_rules_catalog.json`. | Category | Products | |----------|----------| | **Firewalls** | Cisco ASA, Palo Alto PAN NGFW, Fortinet FortiGate, Checkpoint, Meraki, SonicWall, pfSense, Barracuda, Juniper SRX | | **Endpoint Protection** | CrowdStrike Falcon, Symantec SEP, Carbon Black, Cylance, SentinelOne, Deep Instinct, Sophos, McAfee, ESET | | **Network Detection** | Cisco Stealthwatch, Cisco Firepower, Suricata IDS, TippingPoint IPS, Vectra Cognito | | **Identity & Access** | Microsoft Windows AD, Cisco ISE, Centrify, CyberArk, Duo, Okta, Beyond Trust | | **Cloud Security** | AWS CloudTrail, AWS VPC Flow Logs, AWS GuardDuty, Azure Security, Zscaler, Netskope, Cisco Umbrella | | **Email Security** | ProofPoint, Mimecast, FireEye EX, Barracuda ESS, Cisco IronPort, Checkpoint Harmony | | **Threat Intelligence** | FireEye NX/HX/AX/CMS, Trend Micro, QRadar, Microsoft ATA, Cortex XDR | | **Infrastructure** | VMware vCenter/NSX, Elastic Filebeat, Linux (sshd, PAM, systemd, auditd), Apache, HAProxy | ## Labeling Methodology **Three-tier labels** derived from two sources: - **`malicious`** (125,780): Events embedded as leads inside confirmed incidents. Extracted directly from incident lead objects with suspicion scores, modus operandi, and MITRE mappings. - **`suspicious`** (622,700): Events matching WitFoo's 261 lead detection rules (e.g., "ASA Deny", "Windows Failed Login", "CrowdStrike Detection") but not in confirmed incidents. - **`benign`** (113,326,050): Events not matching any detection rules and not in any incident. The `matched_rules` column shows which detection rules matched. The `set_roles` column shows WitFoo classification roles (Exploiting Host, C2 Server, Staging Target, etc.). ## Lead Detection Rules 261 rules included in `reference/lead_rules_catalog.json`. Examples: | Rule | Criteria | Roles | |------|----------|-------| | Blocked Action | Any firewall block | Exploiting Host → Exploiting Target | | ASA Deny | cisco_asa + deny | Exploiting Host → Exploiting Target | | Windows Failed Login | Event ID 4625 | Exploiting Target → Exploiting Host | | CrowdStrike Detection | CrowdStrike stream | Exploiting Target → Exploiting Host | | AWS VPC Reject | VPC flow + REJECT | Exploiting Host → Exploiting Target | | Authentication Failure | auth_failure type | Exploiting Host → Exploiting Target | | Audit log cleared | Event ID 1102 | Exploiting Target → Exploiting Host | ## Sanitization All customer-identifying information has been removed through a comprehensive, iterative four-layer sanitization pipeline. Quality was prioritized over processing speed. The pipeline is [open source](https://github.com/witfoo/dataset-from-precinct6) under the Apache 2.0 license. 1. **Structured field sanitization + Aho-Corasick multi-pattern sweep** — Known fields are replaced with deterministic tokens (IPs to [RFC 5737](https://datatracker.ietf.org/doc/html/rfc5737) ranges, hostnames to `HOST-NNNN`, etc.). Every record is then swept using an [Aho-Corasick automaton](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) built from all 302,000+ registry entries to catch PII in unexpected contexts. Product identifiers are explicitly protected from the sweep. 2. **Format-specific log message parsing** — Eight specialized parsers handle Cisco ASA syslog, Windows Security Event XML, WinLogBeat JSON, AWS CloudTrail, Palo Alto Networks, VMware vCenter, DNS logs, and a generic fallback — sanitizing PII within structured fields like XML elements, nested JSON, and CSV columns. 3. **Machine learning residual detection** — [Microsoft Presidio](https://microsoft.github.io/presidio/) (spaCy NLP) and [BERT NER](https://huggingface.co/dslim/bert-base-NER) scan sanitized records for residual PII that pattern-based approaches miss. New discoveries trigger full re-sanitization. 4. **Large language model contextual review** — [Claude](https://www.anthropic.com/claude) reviews stratified samples for subtle PII: embedded org names, internal hostnames revealing organizational structure, employee names in file paths, AD group names, and geographic identifiers. Findings trigger re-sanitization. **Iterative convergence:** The four layers run in cycles. PII discovered by ML/AI in one cycle is caught automatically by Layer 1 in all subsequent cycles across the complete dataset. Cycles repeat until near-zero new discoveries. **Final PII registry:** ~302,000 unique mappings across 14 categories (IPs, hostnames, usernames, orgs, credentials, SIDs, emails, ARNs, etc.). All replacements are consistent — the same original value always maps to the same token, preserving graph topology. ## Graph Data | Component | Count | |-----------|-------| | Nodes | 23,362 | | Edges | 32,732,650 | | Incidents | 10,442 (Data Theft: 10,441, Phishing: 1) | ## Limitations - **Label imbalance**: 99.77% benign reflects production SOC reality. Sampling strategies needed for balanced training. - **Temporal scope**: July–August 2024 - **Sanitization**: ~302,000 PII registry entries used for consistent replacement across all records - **Shared incidents**: Same 10,442 incidents appear in both 2M and 114M datasets - **Sanitization trade-offs**: Some log message detail reduced by PII replacement ## Citation ```bibtex @dataset{witfoo_precinct6_100m_2025, title={WitFoo Precinct6 Cybersecurity Dataset (114M)}, author={WitFoo, Inc.}, year={2025}, url={https://huggingface.co/datasets/witfoo/precinct6-cybersecurity-100m}, license={Apache-2.0} } ``` ## License [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)
提供机构:
bijaye
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作