nist-cybersecurity-training
收藏魔搭社区2026-04-28 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/nist-cybersecurity-training
下载链接
链接失效反馈官方服务:
资源简介:
# NIST Cybersecurity Training Dataset v1.1
**The largest open-source NIST cybersecurity training dataset for fine-tuning LLMs**
## Version 1.1 Highlights
**What's New in v1.1**:
- ✅ Added **CSWP (Cybersecurity White Papers)** series - 23 new documents
- ✅ Fixed **6,150 broken DOI links** via format normalization
- ✅ Removed **202 malformed DOIs** (double URL prefixes)
- ✅ Validated and fixed **124,946 total links**
- ✅ Cataloged **72,698 broken links** for future recovery
- ✅ **0 broken link markers** remaining in training data
- ✅ Increased from **523,706** to **530,912 examples** (+7,206)
## Dataset Overview
This dataset contains structured training data extracted from **596 NIST publications** including:
- **FIPS** (Federal Information Processing Standards)
- **SP** (Special Publications) - 800 & 1800 series
- **IR** (Interagency/Internal Reports)
- **CSWP** (Cybersecurity White Papers) ✨ NEW in v1.1
### Statistics (v1.1)
- **Total Examples**: 530,912
- **Training Split**: 424,729 examples (80%)
- **Validation Split**: 106,183 examples (20%)
- **Documents Processed**: 596 NIST publications
- **Working DOI Links**: 22,252
- **Working External URLs**: 39,228
- **Average Content Length**: 539 characters
- **Median Content Length**: 298 characters
### Example Distribution
| Type | Count | Description |
|------|-------|-------------|
| Sections | 263,252 | Document sections with contextual content |
| Semantic Chunks | 136,320 | Semantically coherent text chunks |
| Controls | 88,126 | Security control descriptions (SP 800-53) |
| Definitions | 43,214 | Technical term definitions |
## What's Included
### Core Training Data
- **train.jsonl** - 424,729 examples for training
- **valid.jsonl** - 106,183 examples for validation
Each example contains:
```json
{
"messages": [
{"role": "system", "content": "You are a cybersecurity expert..."},
{"role": "user", "content": "What is Zero Trust Architecture?"},
{"role": "assistant", "content": "According to NIST SP 800-207..."}
],
"metadata": {
"source": "NIST SP 800-207",
"type": "section",
"chunk_id": 0
}
}
```
### Vector Embeddings (Optional)
- **train_embeddings.parquet** - 1536-dim embeddings for all training examples
- **valid_embeddings.parquet** - 1536-dim embeddings for validation
- **train_index.faiss** - FAISS index for similarity search
- **valid_index.faiss** - FAISS index for validation set
Generated using OpenAI `text-embedding-3-small` for RAG applications.
## v1.1 Quality Improvements
### Link Validation & Cleanup
**Broken Links Cataloged** (for future recovery):
- **Broken DOIs**: 10,822 (814 unique)
- **Broken URLs**: 61,876 (6,837 unique)
- **Total Examples Affected**: 33,105 (6.2% of dataset)
All broken links have been:
1. ✅ Cataloged with context for future recovery
2. ✅ Removed from training data (no `[BROKEN-DOI:]` or `[BROKEN-URL:]` markers)
3. ✅ Documented in `broken_links_catalog.json` (available in source repo)
**Link Fixes Applied**:
- Fixed 6,150 DOI format variations (NIST.SP ↔ NIST-SP)
- Removed 202 malformed DOI double prefixes
- Validated 124,946 total links
- 0 malformed DOIs remaining
### New Document Series: CSWP
Added 23 Cybersecurity White Papers including:
- NIST Cybersecurity Framework (CSF) 2.0
- Planning for Zero Trust Architecture
- Post-Quantum Cryptography guidance
- IoT Cybersecurity Labeling criteria
- Privacy Framework v1.0
- Cyber Supply Chain Risk Management case studies
## Use Cases
1. **Fine-tune LLMs** for NIST cybersecurity expertise
2. **RAG applications** with validated embeddings
3. **Chatbots** for compliance and security guidance
4. **Question answering** about NIST standards
5. **Automated compliance** checking tools
## Data Format
**JSONL Chat Format** (compatible with OpenAI, Anthropic, MLX):
```python
import json
with open('train.jsonl', 'r') as f:
for line in f:
example = json.loads(line)
messages = example['messages']
metadata = example['metadata']
# Use for fine-tuning
```
## Training Example
**Fine-tuning with MLX (Apple Silicon)**:
```bash
python -m mlx_lm.lora \
--model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit \
--train \
--data . \
--iters 1000 \
--batch-size 4 \
--adapter-path nist-lora
```
**Training with Transformers**:
```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
dataset = load_dataset("ethanolivertroy/nist-cybersecurity-training")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")
# Fine-tune with your preferred trainer
```
## Known Limitations (v1.1)
1. **Broken Links**: 72,698 links could not be validated (cataloged for future recovery)
2. **Document Coverage**: NIST continuously publishes new documents
3. **Link Freshness**: External references may become outdated over time
4. **Chunking**: Some long documents may have context boundaries in chunks
See `broken_links_catalog.json` in the source repository for detailed broken link tracking.
## Changelog
### v1.1 (2025-10-21)
- Added CSWP (Cybersecurity White Papers) series: 23 documents
- Fixed 6,150 broken DOI links via format normalization
- Removed 202 malformed DOIs (double URL prefixes)
- Cataloged 72,698 broken links for future recovery
- Increased dataset from 523,706 to 530,912 examples (+7,206)
- Improved link validation: 124,946 total links processed
- Clean dataset: 0 broken link markers remaining
### v1.0 (2025-10-15)
- Initial release with 523,706 training examples
- 568 NIST documents extracted (FIPS, SP, IR series)
- Published largest NIST cybersecurity dataset on Hugging Face
## Source Code
Full pipeline and scripts: [GitHub Repository](https://github.com/ethanolivertroy/nist-tuned-model)
## Citation
```bibtex
@misc{nist-cybersecurity-training-v1.1,
title={NIST Cybersecurity Training Dataset},
author={Troy, Ethan Oliver},
year={2025},
version={1.1},
url={https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training}
}
```
## License
CC0 1.0 Universal (Public Domain)
All NIST publications are in the public domain.
## Acknowledgments
- NIST Computer Security Resource Center (CSRC)
- Docling PDF extraction framework
- MLX framework for Apple Silicon training
- OpenAI for embedding generation API
---
**Last Updated**: 2025-10-21
**Dataset Version**: 1.1
**Total Examples**: 530,912
**Documents**: 596 NIST publications
# NIST网络安全训练数据集v1.1
**专为大语言模型(LLM)微调打造的最大规模开源NIST网络安全训练数据集**
## v1.1版本更新亮点
### v1.1版本更新内容:
- ✅ 新增**CSWP(网络安全白皮书)**系列,共计23份全新文档
- ✅ 通过格式标准化修复了**6150个失效数字对象标识符(DOI)链接**
- ✅ 移除了**202个格式错误的DOI**(含重复URL前缀问题)
- ✅ 完成**总计124946个链接**的验证与修复
- ✅ 归档**72698个失效链接**以供后续恢复
- ✅ 训练数据中已无**失效链接标记**
- ✅ 样本总量从**523706**提升至**530912**(新增7206个样本)
## 数据集概览
本数据集包含从**596份NIST官方出版物**中提取的结构化训练数据,涵盖:
- **FIPS(联邦信息处理标准)**
- **SP(特别出版物)**——包含800与1800系列
- **IR(跨部门/内部报告)**
- **CSWP(网络安全白皮书)**✨ v1.1版本新增
### v1.1版本统计数据
- **总样本量**:530912
- **训练集划分**:424729个样本(占比80%)
- **验证集划分**:106183个样本(占比20%)
- **处理文档数**:596份NIST出版物
- **可用DOI链接数**:22252
- **可用外部URL数**:39228
- **平均内容长度**:539字符
- **内容长度中位数**:298字符
### 样本类型分布
| 样本类型 | 数量 | 说明 |
|------|-------|-------------|
| 文档段落 | 263252 | 带上下文内容的文档段落 |
| 语义分块 | 136320 | 语义连贯的文本分块 |
| 安全控制项 | 88126 | 安全控制描述(对应SP 800-53标准) |
| 术语定义 | 43214 | 技术术语定义 |
## 数据集内容
### 核心训练数据
- **train.jsonl**:用于模型训练的424729个样本
- **valid.jsonl**:用于模型验证的106183个样本
每个样本结构如下:
json
{
"messages": [
{"role": "system", "content": "You are a cybersecurity expert..."},
{"role": "user", "content": "What is Zero Trust Architecture?"},
{"role": "assistant", "content": "According to NIST SP 800-207..."}
],
"metadata": {
"source": "NIST SP 800-207",
"type": "section",
"chunk_id": 0
}
}
### 向量嵌入(可选)
- **train_embeddings.parquet**:所有训练样本的1536维向量嵌入
- **valid_embeddings.parquet**:验证集样本的1536维向量嵌入
- **train_index.faiss**:用于相似度检索的FAISS索引
- **valid_index.faiss**:验证集对应的FAISS索引
该嵌入通过OpenAI `text-embedding-3-small` 生成,适用于检索增强生成(RAG)场景。
## v1.1版本质量优化
### 链接验证与清理
**已归档失效链接**(用于后续恢复):
- **失效DOI**:10822个(对应814个唯一链接)
- **失效外部URL**:61876个(对应6837个唯一链接)
- **受影响总样本数**:33105个(占数据集的6.2%)
所有失效链接均已完成以下处理:
1. ✅ 归档并附带上下文信息,便于后续恢复
2. ✅ 从训练数据中移除(不再保留`[BROKEN-DOI:]`或`[BROKEN-URL:]`标记)
3. ✅ 已在`broken_links_catalog.json`中记录详情(可在源码仓库获取)
**已完成的链接修复工作**:
- 修复了6150个DOI格式变体问题(如`NIST.SP`与`NIST-SP`格式不统一)
- 移除了202个存在重复URL前缀的格式错误DOI
- 完成总计124946个链接的验证
- 无剩余格式错误的DOI
### 新增文档系列:CSWP
新增23份网络安全白皮书,涵盖:
- NIST网络安全框架(CSF)2.0版
- 零信任架构(Zero Trust Architecture)部署规划
- 后量子密码技术指南
- IoT网络安全标识标准
- 隐私框架v1.0版
- 网络供应链风险管理案例研究
## 应用场景
1. 针对NIST网络安全专业知识对大语言模型(LLM)进行微调
2. 结合已验证的向量嵌入构建检索增强生成(RAG)应用
3. 搭建合规与安全咨询聊天机器人
4. 构建NIST标准相关问答系统
5. 开发自动化合规检查工具
## 数据格式
**JSONL对话格式**(兼容OpenAI、Anthropic、MLX框架):
python
import json
with open('train.jsonl', 'r') as f:
for line in f:
example = json.loads(line)
messages = example['messages']
metadata = example['metadata']
# Use for fine-tuning
## 训练示例
**基于MLX框架(苹果硅芯片)微调**:
bash
python -m mlx_lm.lora
--model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit
--train
--data .
--iters 1000
--batch-size 4
--adapter-path nist-lora
**基于Transformers框架训练**:
python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
dataset = load_dataset("ethanolivertroy/nist-cybersecurity-training")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")
# Fine-tune with your preferred trainer
## v1.1版本已知局限性
1. **失效链接**:仍有72698个链接无法完成验证(已归档以供后续恢复)
2. **文档覆盖范围**:NIST持续发布新的官方出版物
3. **链接时效性**:外部引用链接可能随时间推移失效
4. **文本分块限制**:部分长文档的分块可能存在上下文边界割裂问题
如需查看失效链接的详细追踪记录,请参阅源码仓库中的`broken_links_catalog.json`文件。
## 更新日志
### v1.1(2025-10-21)
- 新增CSWP(网络安全白皮书)系列,共计23份文档
- 通过格式标准化修复6150个失效DOI链接
- 移除202个格式错误的DOI(含重复URL前缀问题)
- 归档72698个失效链接以供后续恢复
- 样本总量从523706提升至530912(新增7206个样本)
- 优化链接验证流程,总计完成124946个链接的处理
- 数据集清理完成:无剩余失效链接标记
### v1.0(2025-10-15)
- 首次发布,包含523706个训练样本
- 提取568份NIST出版物(涵盖FIPS、SP、IR系列)
- 在Hugging Face平台发布当前规模最大的NIST网络安全数据集
## 源码仓库
完整流程与脚本:[GitHub源码仓库](https://github.com/ethanolivertroy/nist-tuned-model)
## 引用格式
bibtex
@misc{nist-cybersecurity-training-v1.1,
title={NIST Cybersecurity Training Dataset},
author={Troy, Ethan Oliver},
year={2025},
version={1.1},
url={https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training}
}
## 开源许可
CC0 1.0 通用公共领域授权
所有NIST官方出版物均属于公共领域。
## 致谢
- NIST计算机安全资源中心(CSRC)
- Docling PDF提取框架
- 用于苹果硅芯片训练的MLX框架
- OpenAI向量生成API
---
**最后更新时间**:2025-10-21
**数据集版本**:1.1
**总样本量**:530912
**处理文档数**:596份NIST出版物
提供机构:
maas
创建时间:
2025-10-16



