nist-publications-raw
收藏魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/nist-publications-raw
下载链接
链接失效反馈官方服务:
资源简介:
# NIST Publications - Raw PDFs
**596 NIST cybersecurity publications in original PDF format** - Complete source data for the [nist-cybersecurity-training](https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training) dataset.
## Dataset Description
This dataset contains the **raw, unprocessed PDF files** downloaded from the NIST Computer Security Resource Center (CSRC). These are the exact source documents used to create the NIST cybersecurity training dataset and fine-tune the HackIDLE-NIST-Coder model.
### Contents
- **596 PDF documents** (2.0 GB total)
- **metadata.json** with titles, URLs, and download information
- Covers documents published through October 2025
### Document Series
| Series | Count | Description |
|--------|-------|-------------|
| **FIPS** | ~25 | Federal Information Processing Standards (cryptography) |
| **SP 800** | ~350 | Special Publications - Security guidelines |
| **SP 1800** | ~30 | Special Publications - Practice guides |
| **IR** | ~160 | Interagency/Internal Reports |
| **CSWP** | ~31 | Cybersecurity White Papers |
### Key Documents Included
- **SP 800-53 Rev. 5** - Security and Privacy Controls
- **SP 800-63-4** - Digital Identity Guidelines (July 2025 release)
- **NIST CSF 2.0** - Cybersecurity Framework (CSWP series)
- **SP 800-207** - Zero Trust Architecture
- **SP 800-37 Rev. 2** - Risk Management Framework
- **FIPS 140-3** - Cryptographic Module Validation
- **SP 800-161 Rev. 1** - Supply Chain Risk Management
- **SP 800-171 Rev. 3** - Protecting Controlled Unclassified Information
## Data Pipeline
This raw dataset is part of a complete data processing pipeline:
```
1. Raw PDFs (this dataset)
↓
2. Extraction (Docling + MarkItDown)
↓
3. Training Data Preparation
↓
4. Training Dataset: ethanolivertroy/nist-cybersecurity-training
↓
5. Fine-tuned Models: HackIDLE-NIST-Coder v1.1
```
## Usage
### Download All PDFs
```python
from datasets import load_dataset
# Download the dataset
dataset = load_dataset("ethanolivertroy/nist-publications-raw")
# Access metadata
with open("metadata.json", "r") as f:
metadata = json.load(f)
# List all PDFs
pdf_files = [f for f in os.listdir(".") if f.endswith(".pdf")]
print(f"Downloaded {len(pdf_files)} PDFs")
```
### Download Specific Document
```python
from huggingface_hub import hf_hub_download
# Download a specific PDF
pdf_path = hf_hub_download(
repo_id="ethanolivertroy/nist-publications-raw",
filename="Digital Identity Guidelines_ Authentication and Authenticator Management.pdf",
repo_type="dataset"
)
```
## Reproduction
To reproduce this dataset (download fresh copies from NIST):
```bash
# Clone the source repository
git clone https://github.com/ethanolivertroy/nist-tuned-model
# Run the scraper
cd nist-tuned-model
python src/download_all_nist.py --output data/raw
```
The scraper downloads documents with these filters:
- Series: FIPS, SP, IR, CSWP
- Status: Final (published)
- From: NIST Computer Security Resource Center
## Metadata
The `metadata.json` file contains structured information for each PDF:
```json
{
"title": "Document title",
"detail_url": "https://csrc.nist.gov/pubs/...",
"local_path": "data/raw/filename.pdf",
"downloaded": true
}
```
## Related Resources
- **Training Dataset**: [ethanolivertroy/nist-cybersecurity-training](https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training) - 530,912 processed examples
- **MLX Model**: [ethanolivertroy/HackIDLE-NIST-Coder-v1.1-MLX-4bit](https://huggingface.co/ethanolivertroy/HackIDLE-NIST-Coder-v1.1-MLX-4bit)
- **GGUF Model**: [ethanolivertroy/HackIDLE-NIST-Coder-v1.1-GGUF](https://huggingface.co/ethanolivertroy/HackIDLE-NIST-Coder-v1.1-GGUF)
- **Ollama**: [etgohome/hackidle-nist-coder](https://ollama.com/etgohome/hackidle-nist-coder)
## License
**CC0 1.0 Universal (Public Domain)**
All NIST publications are in the public domain and not subject to copyright in the United States. This dataset is released under CC0 for maximum reusability.
## Citation
```bibtex
@misc{nist-publications-raw,
title={NIST Cybersecurity Publications - Raw PDFs},
author={Troy, Ethan Oliver},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/datasets/ethanolivertroy/nist-publications-raw}}
}
```
## Acknowledgments
- **NIST Computer Security Resource Center** - Original source of all documents
- **NIST Cybersecurity Framework Team** - Framework development
- **NIST Privacy Framework Team** - Privacy guidance
## Dataset Statistics
- **Total Size**: 2.0 GB
- **Document Count**: 596
- **Date Range**: 1977-2025 (primarily 2010-2025)
- **Format**: PDF (various versions)
- **Languages**: English
## Version History
- **v1.1** (October 2025): Added CSWP series (28 documents), SP 800-63-4
- **v1.0** (Initial release): 568 documents (FIPS, SP 800/1800, IR)
---
**Maintainer**: Ethan Oliver Troy
**Contact**: Via GitHub or HuggingFace profile
**Last Updated**: October 2025
# NIST 出版物 - 原始PDF文件
**596份NIST(美国国家标准与技术研究院,National Institute of Standards and Technology)网络安全出版物的原始PDF格式文件**——为[`nist-cybersecurity-training`](https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training)数据集提供完整的源数据。
## 数据集概述
本数据集收录了从NIST计算机安全资源中心(Computer Security Resource Center, CSRC)下载的**原始未处理PDF文件**,这些正是用于构建NIST网络安全训练数据集以及微调`HackIDLE-NIST-Coder`模型的精准源文档。
### 数据集内容
- **596份PDF文档**(总大小2.0 GB)
- **`metadata.json`** 文件,包含文档标题、下载链接及元数据信息
- 涵盖截至2025年10月发布的各类文档
### 文档系列
| 系列 | 数量 | 描述 |
|------|------|------|
| **FIPS** | ~25 | FIPS(联邦信息处理标准,Federal Information Processing Standards):密码学相关标准 |
| **SP 800** | ~350 | SP 800(特殊出版物,Special Publications):网络安全指南 |
| **SP 1800** | ~30 | SP 1800(特殊出版物,Special Publications):实践指南 |
| **IR** | ~160 | IR(机构间/内部报告,Interagency/Internal Reports) |
| **CSWP** | ~31 | CSWP(网络安全白皮书,Cybersecurity White Papers) |
### 收录核心文档
- **SP 800-53 第5修订版**:安全与隐私控制措施
- **SP 800-63-4**:2025年7月发布的数字身份指南
- **NIST CSF 2.0**:网络安全框架(Cybersecurity Framework, CSF,属于CSWP系列)
- **SP 800-207**:零信任架构(Zero Trust Architecture)
- **SP 800-37 第2修订版**:风险管理框架
- **FIPS 140-3**:密码模块验证标准
- **SP 800-161 第1修订版**:供应链风险管理
- **SP 800-171 第3修订版**:非受控涉密信息保护
## 数据处理流水线
本原始数据集属于完整数据处理流水线的一环:
1. 原始PDF文件(本数据集)
↓
2. 文档提取(采用Docling与MarkItDown工具)
↓
3. 训练数据预处理
↓
4. 训练数据集:ethanolivertroy/nist-cybersecurity-training
↓
5. 微调模型:HackIDLE-NIST-Coder v1.1
## 使用方法
### 下载全部PDF文件
python
from datasets import load_dataset
# 下载数据集
dataset = load_dataset("ethanolivertroy/nist-publications-raw")
# 读取元数据
with open("metadata.json", "r") as f:
metadata = json.load(f)
# 列出所有PDF文件
pdf_files = [f for f in os.listdir(".") if f.endswith(".pdf")]
print(f"已下载 {len(pdf_files)} 份PDF文件")
### 下载指定文档
python
from huggingface_hub import hf_hub_download
# 下载指定PDF文件
pdf_path = hf_hub_download(
repo_id="ethanolivertroy/nist-publications-raw",
filename="Digital Identity Guidelines_ Authentication and Authenticator Management.pdf",
repo_type="dataset"
)
## 数据集复现
若需复现本数据集(从NIST官网下载最新副本):
bash
# 克隆源仓库
git clone https://github.com/ethanolivertroy/nist-tuned-model
# 运行爬虫脚本
cd nist-tuned-model
python src/download_all_nist.py --output data/raw
该爬虫将按照以下筛选条件下载文档:
- 系列:FIPS、SP、IR、CSWP
- 状态:最终版(已发布)
- 来源:NIST计算机安全资源中心
## 元数据说明
`metadata.json` 文件包含每份PDF的结构化元数据:
json
{
"title": "文档标题",
"detail_url": "https://csrc.nist.gov/pubs/...",
"local_path": "data/raw/filename.pdf",
"downloaded": true
}
## 相关资源
- **训练数据集**:[ethanolivertroy/nist-cybersecurity-training](https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training) —— 包含530,912条处理后样本
- **MLX模型**:[ethanolivertroy/HackIDLE-NIST-Coder-v1.1-MLX-4bit](https://huggingface.co/ethanolivertroy/HackIDLE-NIST-Coder-v1.1-MLX-4bit)
- **GGUF模型**:[ethanolivertroy/HackIDLE-NIST-Coder-v1.1-GGUF](https://huggingface.co/ethanolivertroy/HackIDLE-NIST-Coder-v1.1-GGUF)
- **Ollama**:[etgohome/hackidle-nist-coder](https://ollama.com/etgohome/hackidle-nist-coder)
## 授权协议
**CC0 1.0 通用公共领域授权(Public Domain)**
所有NIST出版物均属于公共领域,在美国不受版权保护。本数据集采用CC0协议发布,以最大化其可复用性。
## 引用格式
bibtex
@misc{nist-publications-raw,
title={NIST Cybersecurity Publications - Raw PDFs},
author={Troy, Ethan Oliver},
year={2025},
publisher={Hugging Face},
howpublished={url{https://huggingface.co/datasets/ethanolivertroy/nist-publications-raw}}
}
## 致谢
- **NIST计算机安全资源中心**:所有文档的原始来源
- **NIST网络安全框架团队**:框架开发工作
- **NIST隐私框架团队**:隐私指南制定工作
## 数据集统计
- **总大小**:2.0 GB
- **文档数量**:596份
- **时间范围**:1977年-2025年(主要集中在2010年-2025年)
- **格式**:PDF(多种版本)
- **语言**:英语
## 版本历史
- **v1.1**(2025年10月):新增CSWP系列文档(28份)及SP 800-63-4
- **v1.0**(初始发布):包含568份文档(涵盖FIPS、SP 800/1800及IR系列)
---
**维护者**:Ethan Oliver Troy
**联系方式**:通过GitHub或HuggingFace个人主页联系
**最后更新时间**:2025年10月
提供机构:
maas
创建时间:
2025-10-23



