five

nist-publications-raw

收藏
魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/nist-publications-raw
下载链接
链接失效反馈
官方服务:
资源简介:
# NIST Publications - Raw PDFs **596 NIST cybersecurity publications in original PDF format** - Complete source data for the [nist-cybersecurity-training](https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training) dataset. ## Dataset Description This dataset contains the **raw, unprocessed PDF files** downloaded from the NIST Computer Security Resource Center (CSRC). These are the exact source documents used to create the NIST cybersecurity training dataset and fine-tune the HackIDLE-NIST-Coder model. ### Contents - **596 PDF documents** (2.0 GB total) - **metadata.json** with titles, URLs, and download information - Covers documents published through October 2025 ### Document Series | Series | Count | Description | |--------|-------|-------------| | **FIPS** | ~25 | Federal Information Processing Standards (cryptography) | | **SP 800** | ~350 | Special Publications - Security guidelines | | **SP 1800** | ~30 | Special Publications - Practice guides | | **IR** | ~160 | Interagency/Internal Reports | | **CSWP** | ~31 | Cybersecurity White Papers | ### Key Documents Included - **SP 800-53 Rev. 5** - Security and Privacy Controls - **SP 800-63-4** - Digital Identity Guidelines (July 2025 release) - **NIST CSF 2.0** - Cybersecurity Framework (CSWP series) - **SP 800-207** - Zero Trust Architecture - **SP 800-37 Rev. 2** - Risk Management Framework - **FIPS 140-3** - Cryptographic Module Validation - **SP 800-161 Rev. 1** - Supply Chain Risk Management - **SP 800-171 Rev. 3** - Protecting Controlled Unclassified Information ## Data Pipeline This raw dataset is part of a complete data processing pipeline: ``` 1. Raw PDFs (this dataset) ↓ 2. Extraction (Docling + MarkItDown) ↓ 3. Training Data Preparation ↓ 4. Training Dataset: ethanolivertroy/nist-cybersecurity-training ↓ 5. Fine-tuned Models: HackIDLE-NIST-Coder v1.1 ``` ## Usage ### Download All PDFs ```python from datasets import load_dataset # Download the dataset dataset = load_dataset("ethanolivertroy/nist-publications-raw") # Access metadata with open("metadata.json", "r") as f: metadata = json.load(f) # List all PDFs pdf_files = [f for f in os.listdir(".") if f.endswith(".pdf")] print(f"Downloaded {len(pdf_files)} PDFs") ``` ### Download Specific Document ```python from huggingface_hub import hf_hub_download # Download a specific PDF pdf_path = hf_hub_download( repo_id="ethanolivertroy/nist-publications-raw", filename="Digital Identity Guidelines_ Authentication and Authenticator Management.pdf", repo_type="dataset" ) ``` ## Reproduction To reproduce this dataset (download fresh copies from NIST): ```bash # Clone the source repository git clone https://github.com/ethanolivertroy/nist-tuned-model # Run the scraper cd nist-tuned-model python src/download_all_nist.py --output data/raw ``` The scraper downloads documents with these filters: - Series: FIPS, SP, IR, CSWP - Status: Final (published) - From: NIST Computer Security Resource Center ## Metadata The `metadata.json` file contains structured information for each PDF: ```json { "title": "Document title", "detail_url": "https://csrc.nist.gov/pubs/...", "local_path": "data/raw/filename.pdf", "downloaded": true } ``` ## Related Resources - **Training Dataset**: [ethanolivertroy/nist-cybersecurity-training](https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training) - 530,912 processed examples - **MLX Model**: [ethanolivertroy/HackIDLE-NIST-Coder-v1.1-MLX-4bit](https://huggingface.co/ethanolivertroy/HackIDLE-NIST-Coder-v1.1-MLX-4bit) - **GGUF Model**: [ethanolivertroy/HackIDLE-NIST-Coder-v1.1-GGUF](https://huggingface.co/ethanolivertroy/HackIDLE-NIST-Coder-v1.1-GGUF) - **Ollama**: [etgohome/hackidle-nist-coder](https://ollama.com/etgohome/hackidle-nist-coder) ## License **CC0 1.0 Universal (Public Domain)** All NIST publications are in the public domain and not subject to copyright in the United States. This dataset is released under CC0 for maximum reusability. ## Citation ```bibtex @misc{nist-publications-raw, title={NIST Cybersecurity Publications - Raw PDFs}, author={Troy, Ethan Oliver}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/datasets/ethanolivertroy/nist-publications-raw}} } ``` ## Acknowledgments - **NIST Computer Security Resource Center** - Original source of all documents - **NIST Cybersecurity Framework Team** - Framework development - **NIST Privacy Framework Team** - Privacy guidance ## Dataset Statistics - **Total Size**: 2.0 GB - **Document Count**: 596 - **Date Range**: 1977-2025 (primarily 2010-2025) - **Format**: PDF (various versions) - **Languages**: English ## Version History - **v1.1** (October 2025): Added CSWP series (28 documents), SP 800-63-4 - **v1.0** (Initial release): 568 documents (FIPS, SP 800/1800, IR) --- **Maintainer**: Ethan Oliver Troy **Contact**: Via GitHub or HuggingFace profile **Last Updated**: October 2025

# NIST 出版物 - 原始PDF文件 **596份NIST(美国国家标准与技术研究院,National Institute of Standards and Technology)网络安全出版物的原始PDF格式文件**——为[`nist-cybersecurity-training`](https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training)数据集提供完整的源数据。 ## 数据集概述 本数据集收录了从NIST计算机安全资源中心(Computer Security Resource Center, CSRC)下载的**原始未处理PDF文件**,这些正是用于构建NIST网络安全训练数据集以及微调`HackIDLE-NIST-Coder`模型的精准源文档。 ### 数据集内容 - **596份PDF文档**(总大小2.0 GB) - **`metadata.json`** 文件,包含文档标题、下载链接及元数据信息 - 涵盖截至2025年10月发布的各类文档 ### 文档系列 | 系列 | 数量 | 描述 | |------|------|------| | **FIPS** | ~25 | FIPS(联邦信息处理标准,Federal Information Processing Standards):密码学相关标准 | | **SP 800** | ~350 | SP 800(特殊出版物,Special Publications):网络安全指南 | | **SP 1800** | ~30 | SP 1800(特殊出版物,Special Publications):实践指南 | | **IR** | ~160 | IR(机构间/内部报告,Interagency/Internal Reports) | | **CSWP** | ~31 | CSWP(网络安全白皮书,Cybersecurity White Papers) | ### 收录核心文档 - **SP 800-53 第5修订版**:安全与隐私控制措施 - **SP 800-63-4**:2025年7月发布的数字身份指南 - **NIST CSF 2.0**:网络安全框架(Cybersecurity Framework, CSF,属于CSWP系列) - **SP 800-207**:零信任架构(Zero Trust Architecture) - **SP 800-37 第2修订版**:风险管理框架 - **FIPS 140-3**:密码模块验证标准 - **SP 800-161 第1修订版**:供应链风险管理 - **SP 800-171 第3修订版**:非受控涉密信息保护 ## 数据处理流水线 本原始数据集属于完整数据处理流水线的一环: 1. 原始PDF文件(本数据集) ↓ 2. 文档提取(采用Docling与MarkItDown工具) ↓ 3. 训练数据预处理 ↓ 4. 训练数据集:ethanolivertroy/nist-cybersecurity-training ↓ 5. 微调模型:HackIDLE-NIST-Coder v1.1 ## 使用方法 ### 下载全部PDF文件 python from datasets import load_dataset # 下载数据集 dataset = load_dataset("ethanolivertroy/nist-publications-raw") # 读取元数据 with open("metadata.json", "r") as f: metadata = json.load(f) # 列出所有PDF文件 pdf_files = [f for f in os.listdir(".") if f.endswith(".pdf")] print(f"已下载 {len(pdf_files)} 份PDF文件") ### 下载指定文档 python from huggingface_hub import hf_hub_download # 下载指定PDF文件 pdf_path = hf_hub_download( repo_id="ethanolivertroy/nist-publications-raw", filename="Digital Identity Guidelines_ Authentication and Authenticator Management.pdf", repo_type="dataset" ) ## 数据集复现 若需复现本数据集(从NIST官网下载最新副本): bash # 克隆源仓库 git clone https://github.com/ethanolivertroy/nist-tuned-model # 运行爬虫脚本 cd nist-tuned-model python src/download_all_nist.py --output data/raw 该爬虫将按照以下筛选条件下载文档: - 系列:FIPS、SP、IR、CSWP - 状态:最终版(已发布) - 来源:NIST计算机安全资源中心 ## 元数据说明 `metadata.json` 文件包含每份PDF的结构化元数据: json { "title": "文档标题", "detail_url": "https://csrc.nist.gov/pubs/...", "local_path": "data/raw/filename.pdf", "downloaded": true } ## 相关资源 - **训练数据集**:[ethanolivertroy/nist-cybersecurity-training](https://huggingface.co/datasets/ethanolivertroy/nist-cybersecurity-training) —— 包含530,912条处理后样本 - **MLX模型**:[ethanolivertroy/HackIDLE-NIST-Coder-v1.1-MLX-4bit](https://huggingface.co/ethanolivertroy/HackIDLE-NIST-Coder-v1.1-MLX-4bit) - **GGUF模型**:[ethanolivertroy/HackIDLE-NIST-Coder-v1.1-GGUF](https://huggingface.co/ethanolivertroy/HackIDLE-NIST-Coder-v1.1-GGUF) - **Ollama**:[etgohome/hackidle-nist-coder](https://ollama.com/etgohome/hackidle-nist-coder) ## 授权协议 **CC0 1.0 通用公共领域授权(Public Domain)** 所有NIST出版物均属于公共领域,在美国不受版权保护。本数据集采用CC0协议发布,以最大化其可复用性。 ## 引用格式 bibtex @misc{nist-publications-raw, title={NIST Cybersecurity Publications - Raw PDFs}, author={Troy, Ethan Oliver}, year={2025}, publisher={Hugging Face}, howpublished={url{https://huggingface.co/datasets/ethanolivertroy/nist-publications-raw}} } ## 致谢 - **NIST计算机安全资源中心**:所有文档的原始来源 - **NIST网络安全框架团队**:框架开发工作 - **NIST隐私框架团队**:隐私指南制定工作 ## 数据集统计 - **总大小**:2.0 GB - **文档数量**:596份 - **时间范围**:1977年-2025年(主要集中在2010年-2025年) - **格式**:PDF(多种版本) - **语言**:英语 ## 版本历史 - **v1.1**(2025年10月):新增CSWP系列文档(28份)及SP 800-63-4 - **v1.0**(初始发布):包含568份文档(涵盖FIPS、SP 800/1800及IR系列) --- **维护者**:Ethan Oliver Troy **联系方式**:通过GitHub或HuggingFace个人主页联系 **最后更新时间**:2025年10月
提供机构:
maas
创建时间:
2025-10-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作