five

nist-cybersecurity-training

收藏
魔搭社区2026-04-28 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/nist-cybersecurity-training
下载链接
链接失效反馈
官方服务:
资源简介:
# NIST Cybersecurity Training Dataset v1.1 **The largest open-source NIST cybersecurity training dataset for fine-tuning LLMs** ## Version 1.1 Highlights **What's New in v1.1**: - ✅ Added **CSWP (Cybersecurity White Papers)** series - 23 new documents - ✅ Fixed **6,150 broken DOI links** via format normalization - ✅ Removed **202 malformed DOIs** (double URL prefixes) - ✅ Validated and fixed **124,946 total links** - ✅ Cataloged **72,698 broken links** for future recovery - ✅ **0 broken link markers** remaining in training data - ✅ Increased from **523,706** to **530,912 examples** (+7,206) ## Dataset Overview This dataset contains structured training data extracted from **596 NIST publications** including: - **FIPS** (Federal Information Processing Standards) - **SP** (Special Publications) - 800 & 1800 series - **IR** (Interagency/Internal Reports) - **CSWP** (Cybersecurity White Papers) ✨ NEW in v1.1 ### Statistics (v1.1) - **Total Examples**: 530,912 - **Training Split**: 424,729 examples (80%) - **Validation Split**: 106,183 examples (20%) - **Documents Processed**: 596 NIST publications - **Working DOI Links**: 22,252 - **Working External URLs**: 39,228 - **Average Content Length**: 539 characters - **Median Content Length**: 298 characters ### Example Distribution | Type | Count | Description | |------|-------|-------------| | Sections | 263,252 | Document sections with contextual content | | Semantic Chunks | 136,320 | Semantically coherent text chunks | | Controls | 88,126 | Security control descriptions (SP 800-53) | | Definitions | 43,214 | Technical term definitions | ## What's Included ### Core Training Data - **train.jsonl** - 424,729 examples for training - **valid.jsonl** - 106,183 examples for validation Each example contains: ```json { "messages": [ {"role": "system", "content": "You are a cybersecurity expert..."}, {"role": "user", "content": "What is Zero Trust Architecture?"}, {"role": "assistant", "content": "According to NIST SP 800-207..."} ], "metadata": { "source": "NIST SP 800-207", "type": "section", "chunk_id": 0 } } ``` ### Vector Embeddings (Optional) - **train_embeddings.parquet** - 1536-dim embeddings for all training examples - **valid_embeddings.parquet** - 1536-dim embeddings for validation - **train_index.faiss** - FAISS index for similarity search - **valid_index.faiss** - FAISS index for validation set Generated using OpenAI `text-embedding-3-small` for RAG applications. ## v1.1 Quality Improvements ### Link Validation & Cleanup **Broken Links Cataloged** (for future recovery): - **Broken DOIs**: 10,822 (814 unique) - **Broken URLs**: 61,876 (6,837 unique) - **Total Examples Affected**: 33,105 (6.2% of dataset) All broken links have been: 1. ✅ Cataloged with context for future recovery 2. ✅ Removed from training data (no `[BROKEN-DOI:]` or `[BROKEN-URL:]` markers) 3. ✅ Documented in `broken_links_catalog.json` (available in source repo) **Link Fixes Applied**: - Fixed 6,150 DOI format variations (NIST.SP ↔ NIST-SP) - Removed 202 malformed DOI double prefixes - Validated 124,946 total links - 0 malformed DOIs remaining ### New Document Series: CSWP Added 23 Cybersecurity White Papers including: - NIST Cybersecurity Framework (CSF) 2.0 - Planning for Zero Trust Architecture - Post-Quantum Cryptography guidance - IoT Cybersecurity Labeling criteria - Privacy Framework v1.0 - Cyber Supply Chain Risk Management case studies ## Use Cases 1. **Fine-tune LLMs** for NIST cybersecurity expertise 2. **RAG applications** with validated embeddings 3. **Chatbots** for compliance and security guidance 4. **Question answering** about NIST standards 5. **Automated compliance** checking tools ## Data Format **JSONL Chat Format** (compatible with OpenAI, Anthropic, MLX): ```python import json with open('train.jsonl', 'r') as f: for line in f: example = json.loads(line) messages = example['messages'] metadata = example['metadata'] # Use for fine-tuning ``` ## Training Example **Fine-tuning with MLX (Apple Silicon)**: ```bash python -m mlx_lm.lora \ --model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit \ --train \ --data . \ --iters 1000 \ --batch-size 4 \ --adapter-path nist-lora ``` **Training with Transformers**: ```python from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer dataset = load_dataset("ethanolivertroy/nist-cybersecurity-training") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct") # Fine-tune with your preferred trainer ``` ## Known Limitations (v1.1) 1. **Broken Links**: 72,698 links could not be validated (cataloged for future recovery) 2. **Document Coverage**: NIST continuously publishes new documents 3. **Link Freshness**: External references may become outdated over time 4. **Chunking**: Some long documents may have context boundaries in chunks See `broken_links_catalog.json` in the source repository for detailed broken link tracking. ## Changelog ### v1.1 (2025-10-21) - Added CSWP (Cybersecurity White Papers) series: 23 documents - Fixed 6,150 broken DOI links via format normalization - Removed 202 malformed DOIs (double URL prefixes) - Cataloged 72,698 broken links for future recovery - Increased dataset from 523,706 to 530,912 examples (+7,206) - Improved link validation: 124,946 total links processed - Clean dataset: 0 broken link markers remaining ### v1.0 (2025-10-15) - Initial release with 523,706 training examples - 568 NIST documents extracted (FIPS, SP, IR series) - Published largest NIST cybersecurity dataset on Hugging Face ## Source Code Full pipeline and scripts: [GitHub Repository](https://github.com/ethanolivertroy/nist-tuned-model) ## Citation ```bibtex @misc{nist-cybersecurity-training-v1.1, title={NIST Cybersecurity Training Dataset}, author={Troy, Ethan Oliver}, year={2025}, version={1.1}, url={https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training} } ``` ## License CC0 1.0 Universal (Public Domain) All NIST publications are in the public domain. ## Acknowledgments - NIST Computer Security Resource Center (CSRC) - Docling PDF extraction framework - MLX framework for Apple Silicon training - OpenAI for embedding generation API --- **Last Updated**: 2025-10-21 **Dataset Version**: 1.1 **Total Examples**: 530,912 **Documents**: 596 NIST publications

# NIST网络安全训练数据集v1.1 **专为大语言模型(LLM)微调打造的最大规模开源NIST网络安全训练数据集** ## v1.1版本更新亮点 ### v1.1版本更新内容: - ✅ 新增**CSWP(网络安全白皮书)**系列,共计23份全新文档 - ✅ 通过格式标准化修复了**6150个失效数字对象标识符(DOI)链接** - ✅ 移除了**202个格式错误的DOI**(含重复URL前缀问题) - ✅ 完成**总计124946个链接**的验证与修复 - ✅ 归档**72698个失效链接**以供后续恢复 - ✅ 训练数据中已无**失效链接标记** - ✅ 样本总量从**523706**提升至**530912**(新增7206个样本) ## 数据集概览 本数据集包含从**596份NIST官方出版物**中提取的结构化训练数据,涵盖: - **FIPS(联邦信息处理标准)** - **SP(特别出版物)**——包含800与1800系列 - **IR(跨部门/内部报告)** - **CSWP(网络安全白皮书)**✨ v1.1版本新增 ### v1.1版本统计数据 - **总样本量**:530912 - **训练集划分**:424729个样本(占比80%) - **验证集划分**:106183个样本(占比20%) - **处理文档数**:596份NIST出版物 - **可用DOI链接数**:22252 - **可用外部URL数**:39228 - **平均内容长度**:539字符 - **内容长度中位数**:298字符 ### 样本类型分布 | 样本类型 | 数量 | 说明 | |------|-------|-------------| | 文档段落 | 263252 | 带上下文内容的文档段落 | | 语义分块 | 136320 | 语义连贯的文本分块 | | 安全控制项 | 88126 | 安全控制描述(对应SP 800-53标准) | | 术语定义 | 43214 | 技术术语定义 | ## 数据集内容 ### 核心训练数据 - **train.jsonl**:用于模型训练的424729个样本 - **valid.jsonl**:用于模型验证的106183个样本 每个样本结构如下: json { "messages": [ {"role": "system", "content": "You are a cybersecurity expert..."}, {"role": "user", "content": "What is Zero Trust Architecture?"}, {"role": "assistant", "content": "According to NIST SP 800-207..."} ], "metadata": { "source": "NIST SP 800-207", "type": "section", "chunk_id": 0 } } ### 向量嵌入(可选) - **train_embeddings.parquet**:所有训练样本的1536维向量嵌入 - **valid_embeddings.parquet**:验证集样本的1536维向量嵌入 - **train_index.faiss**:用于相似度检索的FAISS索引 - **valid_index.faiss**:验证集对应的FAISS索引 该嵌入通过OpenAI `text-embedding-3-small` 生成,适用于检索增强生成(RAG)场景。 ## v1.1版本质量优化 ### 链接验证与清理 **已归档失效链接**(用于后续恢复): - **失效DOI**:10822个(对应814个唯一链接) - **失效外部URL**:61876个(对应6837个唯一链接) - **受影响总样本数**:33105个(占数据集的6.2%) 所有失效链接均已完成以下处理: 1. ✅ 归档并附带上下文信息,便于后续恢复 2. ✅ 从训练数据中移除(不再保留`[BROKEN-DOI:]`或`[BROKEN-URL:]`标记) 3. ✅ 已在`broken_links_catalog.json`中记录详情(可在源码仓库获取) **已完成的链接修复工作**: - 修复了6150个DOI格式变体问题(如`NIST.SP`与`NIST-SP`格式不统一) - 移除了202个存在重复URL前缀的格式错误DOI - 完成总计124946个链接的验证 - 无剩余格式错误的DOI ### 新增文档系列:CSWP 新增23份网络安全白皮书,涵盖: - NIST网络安全框架(CSF)2.0版 - 零信任架构(Zero Trust Architecture)部署规划 - 后量子密码技术指南 - IoT网络安全标识标准 - 隐私框架v1.0版 - 网络供应链风险管理案例研究 ## 应用场景 1. 针对NIST网络安全专业知识对大语言模型(LLM)进行微调 2. 结合已验证的向量嵌入构建检索增强生成(RAG)应用 3. 搭建合规与安全咨询聊天机器人 4. 构建NIST标准相关问答系统 5. 开发自动化合规检查工具 ## 数据格式 **JSONL对话格式**(兼容OpenAI、Anthropic、MLX框架): python import json with open('train.jsonl', 'r') as f: for line in f: example = json.loads(line) messages = example['messages'] metadata = example['metadata'] # Use for fine-tuning ## 训练示例 **基于MLX框架(苹果硅芯片)微调**: bash python -m mlx_lm.lora --model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit --train --data . --iters 1000 --batch-size 4 --adapter-path nist-lora **基于Transformers框架训练**: python from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer dataset = load_dataset("ethanolivertroy/nist-cybersecurity-training") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct") # Fine-tune with your preferred trainer ## v1.1版本已知局限性 1. **失效链接**:仍有72698个链接无法完成验证(已归档以供后续恢复) 2. **文档覆盖范围**:NIST持续发布新的官方出版物 3. **链接时效性**:外部引用链接可能随时间推移失效 4. **文本分块限制**:部分长文档的分块可能存在上下文边界割裂问题 如需查看失效链接的详细追踪记录,请参阅源码仓库中的`broken_links_catalog.json`文件。 ## 更新日志 ### v1.1(2025-10-21) - 新增CSWP(网络安全白皮书)系列,共计23份文档 - 通过格式标准化修复6150个失效DOI链接 - 移除202个格式错误的DOI(含重复URL前缀问题) - 归档72698个失效链接以供后续恢复 - 样本总量从523706提升至530912(新增7206个样本) - 优化链接验证流程,总计完成124946个链接的处理 - 数据集清理完成:无剩余失效链接标记 ### v1.0(2025-10-15) - 首次发布,包含523706个训练样本 - 提取568份NIST出版物(涵盖FIPS、SP、IR系列) - 在Hugging Face平台发布当前规模最大的NIST网络安全数据集 ## 源码仓库 完整流程与脚本:[GitHub源码仓库](https://github.com/ethanolivertroy/nist-tuned-model) ## 引用格式 bibtex @misc{nist-cybersecurity-training-v1.1, title={NIST Cybersecurity Training Dataset}, author={Troy, Ethan Oliver}, year={2025}, version={1.1}, url={https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training} } ## 开源许可 CC0 1.0 通用公共领域授权 所有NIST官方出版物均属于公共领域。 ## 致谢 - NIST计算机安全资源中心(CSRC) - Docling PDF提取框架 - 用于苹果硅芯片训练的MLX框架 - OpenAI向量生成API --- **最后更新时间**:2025-10-21 **数据集版本**:1.1 **总样本量**:530912 **处理文档数**:596份NIST出版物
提供机构:
maas
创建时间:
2025-10-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作