nist-cybersecurity-training

Name: nist-cybersecurity-training
Creator: maas
Published: 2026-04-28 16:52:29
License: 暂无描述

魔搭社区2026-04-28 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/nist-cybersecurity-training

下载链接

链接失效反馈

官方服务：

资源简介：

# NIST Cybersecurity Training Dataset v1.1 **The largest open-source NIST cybersecurity training dataset for fine-tuning LLMs** ## Version 1.1 Highlights **What's New in v1.1**: - ✅ Added **CSWP (Cybersecurity White Papers)** series - 23 new documents - ✅ Fixed **6,150 broken DOI links** via format normalization - ✅ Removed **202 malformed DOIs** (double URL prefixes) - ✅ Validated and fixed **124,946 total links** - ✅ Cataloged **72,698 broken links** for future recovery - ✅ **0 broken link markers** remaining in training data - ✅ Increased from **523,706** to **530,912 examples** (+7,206) ## Dataset Overview This dataset contains structured training data extracted from **596 NIST publications** including: - **FIPS** (Federal Information Processing Standards) - **SP** (Special Publications) - 800 & 1800 series - **IR** (Interagency/Internal Reports) - **CSWP** (Cybersecurity White Papers) ✨ NEW in v1.1 ### Statistics (v1.1) - **Total Examples**: 530,912 - **Training Split**: 424,729 examples (80%) - **Validation Split**: 106,183 examples (20%) - **Documents Processed**: 596 NIST publications - **Working DOI Links**: 22,252 - **Working External URLs**: 39,228 - **Average Content Length**: 539 characters - **Median Content Length**: 298 characters ### Example Distribution | Type | Count | Description | |------|-------|-------------| | Sections | 263,252 | Document sections with contextual content | | Semantic Chunks | 136,320 | Semantically coherent text chunks | | Controls | 88,126 | Security control descriptions (SP 800-53) | | Definitions | 43,214 | Technical term definitions | ## What's Included ### Core Training Data - **train.jsonl** - 424,729 examples for training - **valid.jsonl** - 106,183 examples for validation Each example contains: ```json { "messages": [ {"role": "system", "content": "You are a cybersecurity expert..."}, {"role": "user", "content": "What is Zero Trust Architecture?"}, {"role": "assistant", "content": "According to NIST SP 800-207..."} ], "metadata": { "source": "NIST SP 800-207", "type": "section", "chunk_id": 0 } } ``` ### Vector Embeddings (Optional) - **train_embeddings.parquet** - 1536-dim embeddings for all training examples - **valid_embeddings.parquet** - 1536-dim embeddings for validation - **train_index.faiss** - FAISS index for similarity search - **valid_index.faiss** - FAISS index for validation set Generated using OpenAI `text-embedding-3-small` for RAG applications. ## v1.1 Quality Improvements ### Link Validation & Cleanup **Broken Links Cataloged** (for future recovery): - **Broken DOIs**: 10,822 (814 unique) - **Broken URLs**: 61,876 (6,837 unique) - **Total Examples Affected**: 33,105 (6.2% of dataset) All broken links have been: 1. ✅ Cataloged with context for future recovery 2. ✅ Removed from training data (no `[BROKEN-DOI:]` or `[BROKEN-URL:]` markers) 3. ✅ Documented in `broken_links_catalog.json` (available in source repo) **Link Fixes Applied**: - Fixed 6,150 DOI format variations (NIST.SP ↔ NIST-SP) - Removed 202 malformed DOI double prefixes - Validated 124,946 total links - 0 malformed DOIs remaining ### New Document Series: CSWP Added 23 Cybersecurity White Papers including: - NIST Cybersecurity Framework (CSF) 2.0 - Planning for Zero Trust Architecture - Post-Quantum Cryptography guidance - IoT Cybersecurity Labeling criteria - Privacy Framework v1.0 - Cyber Supply Chain Risk Management case studies ## Use Cases 1. **Fine-tune LLMs** for NIST cybersecurity expertise 2. **RAG applications** with validated embeddings 3. **Chatbots** for compliance and security guidance 4. **Question answering** about NIST standards 5. **Automated compliance** checking tools ## Data Format **JSONL Chat Format** (compatible with OpenAI, Anthropic, MLX): ```python import json with open('train.jsonl', 'r') as f: for line in f: example = json.loads(line) messages = example['messages'] metadata = example['metadata'] # Use for fine-tuning ``` ## Training Example **Fine-tuning with MLX (Apple Silicon)**: ```bash python -m mlx_lm.lora \ --model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit \ --train \ --data . \ --iters 1000 \ --batch-size 4 \ --adapter-path nist-lora ``` **Training with Transformers**: ```python from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer dataset = load_dataset("ethanolivertroy/nist-cybersecurity-training") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct") # Fine-tune with your preferred trainer ``` ## Known Limitations (v1.1) 1. **Broken Links**: 72,698 links could not be validated (cataloged for future recovery) 2. **Document Coverage**: NIST continuously publishes new documents 3. **Link Freshness**: External references may become outdated over time 4. **Chunking**: Some long documents may have context boundaries in chunks See `broken_links_catalog.json` in the source repository for detailed broken link tracking. ## Changelog ### v1.1 (2025-10-21) - Added CSWP (Cybersecurity White Papers) series: 23 documents - Fixed 6,150 broken DOI links via format normalization - Removed 202 malformed DOIs (double URL prefixes) - Cataloged 72,698 broken links for future recovery - Increased dataset from 523,706 to 530,912 examples (+7,206) - Improved link validation: 124,946 total links processed - Clean dataset: 0 broken link markers remaining ### v1.0 (2025-10-15) - Initial release with 523,706 training examples - 568 NIST documents extracted (FIPS, SP, IR series) - Published largest NIST cybersecurity dataset on Hugging Face ## Source Code Full pipeline and scripts: [GitHub Repository](https://github.com/ethanolivertroy/nist-tuned-model) ## Citation ```bibtex @misc{nist-cybersecurity-training-v1.1, title={NIST Cybersecurity Training Dataset}, author={Troy, Ethan Oliver}, year={2025}, version={1.1}, url={https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training} } ``` ## License CC0 1.0 Universal (Public Domain) All NIST publications are in the public domain. ## Acknowledgments - NIST Computer Security Resource Center (CSRC) - Docling PDF extraction framework - MLX framework for Apple Silicon training - OpenAI for embedding generation API --- **Last Updated**: 2025-10-21 **Dataset Version**: 1.1 **Total Examples**: 530,912 **Documents**: 596 NIST publications

# NIST网络安全训练数据集v1.1 **专为大语言模型（LLM）微调打造的最大规模开源NIST网络安全训练数据集** ## v1.1版本更新亮点 ### v1.1版本更新内容： - ✅ 新增**CSWP（网络安全白皮书）**系列，共计23份全新文档 - ✅ 通过格式标准化修复了**6150个失效数字对象标识符（DOI）链接** - ✅ 移除了**202个格式错误的DOI**（含重复URL前缀问题） - ✅ 完成**总计124946个链接**的验证与修复 - ✅ 归档**72698个失效链接**以供后续恢复 - ✅ 训练数据中已无**失效链接标记** - ✅ 样本总量从**523706**提升至**530912**（新增7206个样本） ## 数据集概览本数据集包含从**596份NIST官方出版物**中提取的结构化训练数据，涵盖： - **FIPS（联邦信息处理标准）** - **SP（特别出版物）**——包含800与1800系列 - **IR（跨部门/内部报告）** - **CSWP（网络安全白皮书）**✨ v1.1版本新增 ### v1.1版本统计数据 - **总样本量**：530912 - **训练集划分**：424729个样本（占比80%） - **验证集划分**：106183个样本（占比20%） - **处理文档数**：596份NIST出版物 - **可用DOI链接数**：22252 - **可用外部URL数**：39228 - **平均内容长度**：539字符 - **内容长度中位数**：298字符 ### 样本类型分布 | 样本类型 | 数量 | 说明 | |------|-------|-------------| | 文档段落 | 263252 | 带上下文内容的文档段落 | | 语义分块 | 136320 | 语义连贯的文本分块 | | 安全控制项 | 88126 | 安全控制描述（对应SP 800-53标准） | | 术语定义 | 43214 | 技术术语定义 | ## 数据集内容 ### 核心训练数据 - **train.jsonl**：用于模型训练的424729个样本 - **valid.jsonl**：用于模型验证的106183个样本每个样本结构如下： json { "messages": [ {"role": "system", "content": "You are a cybersecurity expert..."}, {"role": "user", "content": "What is Zero Trust Architecture?"}, {"role": "assistant", "content": "According to NIST SP 800-207..."} ], "metadata": { "source": "NIST SP 800-207", "type": "section", "chunk_id": 0 } } ### 向量嵌入（可选） - **train_embeddings.parquet**：所有训练样本的1536维向量嵌入 - **valid_embeddings.parquet**：验证集样本的1536维向量嵌入 - **train_index.faiss**：用于相似度检索的FAISS索引 - **valid_index.faiss**：验证集对应的FAISS索引该嵌入通过OpenAI `text-embedding-3-small` 生成，适用于检索增强生成（RAG）场景。 ## v1.1版本质量优化 ### 链接验证与清理 **已归档失效链接**（用于后续恢复）： - **失效DOI**：10822个（对应814个唯一链接） - **失效外部URL**：61876个（对应6837个唯一链接） - **受影响总样本数**：33105个（占数据集的6.2%）所有失效链接均已完成以下处理： 1. ✅ 归档并附带上下文信息，便于后续恢复 2. ✅ 从训练数据中移除（不再保留`[BROKEN-DOI:]`或`[BROKEN-URL:]`标记） 3. ✅ 已在`broken_links_catalog.json`中记录详情（可在源码仓库获取） **已完成的链接修复工作**： - 修复了6150个DOI格式变体问题（如`NIST.SP`与`NIST-SP`格式不统一） - 移除了202个存在重复URL前缀的格式错误DOI - 完成总计124946个链接的验证 - 无剩余格式错误的DOI ### 新增文档系列：CSWP 新增23份网络安全白皮书，涵盖： - NIST网络安全框架（CSF）2.0版 - 零信任架构（Zero Trust Architecture）部署规划 - 后量子密码技术指南 - IoT网络安全标识标准 - 隐私框架v1.0版 - 网络供应链风险管理案例研究 ## 应用场景 1. 针对NIST网络安全专业知识对大语言模型（LLM）进行微调 2. 结合已验证的向量嵌入构建检索增强生成（RAG）应用 3. 搭建合规与安全咨询聊天机器人 4. 构建NIST标准相关问答系统 5. 开发自动化合规检查工具 ## 数据格式 **JSONL对话格式**（兼容OpenAI、Anthropic、MLX框架）： python import json with open('train.jsonl', 'r') as f: for line in f: example = json.loads(line) messages = example['messages'] metadata = example['metadata'] # Use for fine-tuning ## 训练示例 **基于MLX框架（苹果硅芯片）微调**： bash python -m mlx_lm.lora --model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit --train --data . --iters 1000 --batch-size 4 --adapter-path nist-lora **基于Transformers框架训练**： python from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer dataset = load_dataset("ethanolivertroy/nist-cybersecurity-training") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct") # Fine-tune with your preferred trainer ## v1.1版本已知局限性 1. **失效链接**：仍有72698个链接无法完成验证（已归档以供后续恢复） 2. **文档覆盖范围**：NIST持续发布新的官方出版物 3. **链接时效性**：外部引用链接可能随时间推移失效 4. **文本分块限制**：部分长文档的分块可能存在上下文边界割裂问题如需查看失效链接的详细追踪记录，请参阅源码仓库中的`broken_links_catalog.json`文件。 ## 更新日志 ### v1.1（2025-10-21） - 新增CSWP（网络安全白皮书）系列，共计23份文档 - 通过格式标准化修复6150个失效DOI链接 - 移除202个格式错误的DOI（含重复URL前缀问题） - 归档72698个失效链接以供后续恢复 - 样本总量从523706提升至530912（新增7206个样本） - 优化链接验证流程，总计完成124946个链接的处理 - 数据集清理完成：无剩余失效链接标记 ### v1.0（2025-10-15） - 首次发布，包含523706个训练样本 - 提取568份NIST出版物（涵盖FIPS、SP、IR系列） - 在Hugging Face平台发布当前规模最大的NIST网络安全数据集 ## 源码仓库完整流程与脚本：[GitHub源码仓库](https://github.com/ethanolivertroy/nist-tuned-model) ## 引用格式 bibtex @misc{nist-cybersecurity-training-v1.1, title={NIST Cybersecurity Training Dataset}, author={Troy, Ethan Oliver}, year={2025}, version={1.1}, url={https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training} } ## 开源许可 CC0 1.0 通用公共领域授权所有NIST官方出版物均属于公共领域。 ## 致谢 - NIST计算机安全资源中心（CSRC） - Docling PDF提取框架 - 用于苹果硅芯片训练的MLX框架 - OpenAI向量生成API --- **最后更新时间**：2025-10-21 **数据集版本**：1.1 **总样本量**：530912 **处理文档数**：596份NIST出版物

提供机构：

maas

创建时间：

2025-10-16

搜集汇总

数据集介绍