cmmc-training-core
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/cmmc-training-core
下载链接
链接失效反馈官方服务:
资源简介:
# CMMC Training Dataset - Core Variant
## Dataset Description
This is the **Core variant** of the CMMC (Cybersecurity Maturity Model Certification) training dataset, containing 1,244 high-quality training examples derived from the most essential NIST cybersecurity publications for CMMC compliance.
### Dataset Characteristics
- **Total Examples**: 1,244 (995 train / 249 validation)
- **Source Documents**: 14 foundational NIST publications
- **CMMC Levels Covered**: Level 1, Level 2, Level 3
- **CMMC Domains**: All 17 domains
- **Format**: JSONL with chat-formatted messages
- **Embeddings**: 1536-dimensional vectors (OpenAI text-embedding-3-small)
- **License**: Public Domain (NIST documents are US Government works)
## What Makes This "Core"?
The Core variant focuses on the **essential foundation** documents that form the basis of CMMC:
### Foundation Documents (14 total)
**Primary CMMC Requirements:**
- **NIST SP 800-171 Rev 3**: Protecting Controlled Unclassified Information (CMMC Level 2)
- **NIST SP 800-171A Rev 3**: Assessment Procedures for SP 800-171
- **NIST SP 800-171B**: Protecting CUI in Nonfederal Systems and Organizations
- **NIST SP 800-172 Rev 3**: Enhanced Security for CUI (CMMC Level 3)
- **NIST SP 800-172A**: Assessment Procedures for SP 800-172
**Master Control Catalog:**
- **NIST SP 800-53 Rev 5**: Security and Privacy Controls
**Supplementary Guidance:**
- **NIST SP 800-37 Rev 2**: Risk Management Framework
This core set provides comprehensive coverage of CMMC requirements across all 17 domains and 3 maturity levels.
## CMMC Level Distribution
```
Level 3 (Advanced): 579 examples (46.5%)
All Levels: 561 examples (45.1%)
Level 2 (Advanced): 104 examples (8.4%)
```
## CMMC Domain Coverage
All 17 CMMC domains are represented (567 examples each):
- Access Control (AC)
- Awareness and Training (AT)
- Audit and Accountability (AU)
- Configuration Management (CM)
- Identification and Authentication (IA)
- Incident Response (IR)
- Maintenance (MA)
- Media Protection (MP)
- Personnel Security (PS)
- Physical Protection (PE)
- Risk Assessment (RA)
- Security Assessment (CA)
- System and Communications Protection (SC)
- System and Information Integrity (SI)
- System and Services Acquisition (SA)
- Planning (PL)
- Supply Chain Risk Management (SR)
**Note**: Domain counts represent the number of examples tagged with each domain. Since examples can be tagged with multiple domains, the sum of domain counts (9,639) exceeds the total number of examples (1,244).
## Dataset Structure
### JSONL Training Files
Each example follows the chat format:
```json
{
"messages": [
{
"role": "system",
"content": "You are a cybersecurity expert specializing in CMMC (Cybersecurity Maturity Model Certification) and NIST frameworks..."
},
{
"role": "user",
"content": "What is the purpose of CMMC Level 2 requirement 3.1.1?"
},
{
"role": "assistant",
"content": "According to NIST SP 800-171 R3, control 3.1.1 (Access Control) requires..."
}
],
"metadata": {
"source": "NIST SP 800-171 R3",
"cmmc_level": "2",
"cmmc_domain": "Access Control",
"cmmc_practice_id": "AC.L2-3.1.1",
"nist_control": "3.1.1",
"type": "cmmc_requirement"
}
}
```
### Vector Embeddings
Pre-computed embeddings using OpenAI's `text-embedding-3-small` model:
- **Format**: Parquet files with 1536-dimensional vectors
- **Files**: `embeddings_train.parquet`, `embeddings_valid.parquet`
- **Size**: 15.3 MB total (12.1 MB train + 3.2 MB validation)
- **Cost**: $0.01 (330,784 tokens processed)
### FAISS Indexes
Ready-to-use vector similarity search indexes:
- **L2 distance indexes**: `faiss_train_l2.index`, `faiss_valid_l2.index`
- **Cosine similarity indexes**: `faiss_train_cosine.index`, `faiss_valid_cosine.index`
## Q&A Generation Strategies
Examples were generated using 5 complementary strategies:
1. **Section-based Q&A**: Questions from document sections
2. **Control-based Q&A**: NIST control requirements (3.1.1 format)
3. **CMMC-specific Q&A**: Level-focused questions (L1/L2/L3)
4. **Domain-specific Q&A**: Questions per CMMC domain
5. **Semantic chunking**: General content with context preservation
## Use Cases
This Core dataset is ideal for:
- **Fine-tuning LLMs** on CMMC compliance requirements
- **Building CMMC chatbots** for compliance guidance
- **RAG systems** for CMMC documentation
- **Semantic search** across CMMC controls
- **Training materials** for CMMC assessors
- **Compliance automation** tools
## Dataset Statistics
```
Source Documents: 14
Total Examples: 1,244
Training Examples: 995 (80%)
Validation Examples: 249 (20%)
Avg Example Length: ~266 tokens
Total Tokens Embedded: 330,784
Embedding Cost: $0.01 USD
```
## Quick Start
### Load JSONL Data
```python
import json
# Load training data
with open('train.jsonl', 'r') as f:
train_data = [json.loads(line) for line in f]
# Example: Access first training example
print(train_data[0]['messages'])
print(train_data[0]['metadata'])
```
### Load Embeddings
```python
import pandas as pd
import numpy as np
# Load embeddings
df = pd.read_parquet('embeddings_train.parquet')
# Access embeddings as numpy array
embeddings = np.vstack(df['embedding'].values)
texts = df['text'].tolist()
print(f"Embeddings shape: {embeddings.shape}") # (995, 1536)
```
### Use FAISS Index
```python
import faiss
# Load FAISS index
index = faiss.read_index('faiss_train_cosine.index')
# Search for similar content
query_embedding = ... # your query vector (1536-dim)
k = 5 # number of results
distances, indices = index.search(query_embedding.reshape(1, -1), k)
# Get similar texts
for i, idx in enumerate(indices[0]):
print(f"{i+1}. {texts[idx][:100]}...")
```
## Related Datasets
This is part of a family of 3 CMMC datasets:
- **Core** (this dataset): 14 docs, 1.2K examples - Essential CMMC foundation
- **Balanced**: 71 docs, 2.8K examples - Domain-balanced coverage
- **Comprehensive**: 381 docs, 11.3K examples - Complete NIST CMMC library
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{cmmc_core_2025,
title={CMMC Training Dataset - Core Variant},
author={Troy, Ethan Oliver},
year={2025},
publisher={HuggingFace},
note={Derived from NIST Special Publications (Public Domain)}
}
```
## License
**Public Domain** - This dataset is derived from NIST Special Publications, which are works of the US Government and not subject to copyright protection in the United States.
## Acknowledgments
This dataset is built from publications by the National Institute of Standards and Technology (NIST), Computer Security Resource Center.
## Dataset Version
- **Version**: 1.0
- **Created**: 2025
- **Source**: NIST CSRC Publications
- **Processing**: Docling + custom CMMC-aware data preparation
## Contact
For questions or issues, please open an issue on the GitHub repository.
# CMMC 训练数据集 - 核心变体
## 数据集说明
本数据集为CMMC(网络安全成熟度模型认证,Cybersecurity Maturity Model Certification)训练数据集的核心变体,包含1244条高质量训练样本,均源自支撑CMMC合规性的核心NIST(美国国家标准与技术研究院,National Institute of Standards and Technology)网络安全出版物。
### 数据集特征
- **总样本数**:1244(995条训练样本 / 249条验证样本)
- **源文献**:14份基础NIST出版物
- **覆盖CMMC等级**:等级1、等级2、等级3
- **覆盖CMMC域**:全部17个域
- **格式**:带聊天格式消息的JSONL文件
- **向量嵌入**:1536维向量(使用OpenAI text-embedding-3-small模型生成)
- **许可证**:公共领域(NIST文档属于美国政府作品)
## 何为“核心”变体?
核心变体聚焦于构成CMMC基础的**核心基础文档**:
### 基础文档(共14份)
**核心CMMC需求文档**:
- **NIST SP 800-171 Rev 3**:保护非涉密受控信息(CMMC等级2)
- **NIST SP 800-171A Rev 3**:SP 800-171评估程序
- **NIST SP 800-171B**:非联邦系统与组织中的非涉密受控信息保护
- **NIST SP 800-172 Rev 3**:非涉密受控信息增强安全防护(CMMC等级3)
- **NIST SP 800-172A**:SP 800-172评估程序
**主控制目录**:
- **NIST SP 800-53 Rev 5**:安全与隐私控制项
**补充指南**:
- **NIST SP 800-37 Rev 2**:风险管理框架
该核心集合全面覆盖了CMMC在全部17个域与3个成熟度等级下的需求。
## CMMC等级分布
等级3(高级):579条样本(占比46.5%)
全等级覆盖: 561条样本(占比45.1%)
等级2(进阶): 104条样本(占比8.4%)
## CMMC域覆盖范围
本数据集覆盖全部17个CMMC域(每个域对应567条标注样本):
- 访问控制(Access Control,AC)
- 意识与培训(Awareness and Training,AT)
- 审计与问责(Audit and Accountability,AU)
- 配置管理(Configuration Management,CM)
- 标识与认证(Identification and Authentication,IA)
- 事件响应(Incident Response,IR)
- 维护(Maintenance,MA)
- 媒体保护(Media Protection,MP)
- 人员安全(Personnel Security,PS)
- 物理防护(Physical Protection,PE)
- 风险评估(Risk Assessment,RA)
- 安全评估(Security Assessment,CA)
- 系统与通信保护(System and Communications Protection,SC)
- 系统与信息完整性(System and Information Integrity,SI)
- 系统与服务采购(System and Services Acquisition,SA)
- 规划(Planning,PL)
- 供应链风险管理(Supply Chain Risk Management,SR)
**注**:域计数代表标注有对应域的样本数量。由于单个样本可被标注多个域,因此域计数总和(9639)远超总样本数(1244)。
## 数据集结构
### JSONL训练文件格式
每条样本遵循聊天格式:
json
{
"messages": [
{
"role": "system",
"content": "You are a cybersecurity expert specializing in CMMC (Cybersecurity Maturity Model Certification) and NIST frameworks..."
},
{
"role": "user",
"content": "What is the purpose of CMMC Level 2 requirement 3.1.1?"
},
{
"role": "assistant",
"content": "According to NIST SP 800-171 R3, control 3.1.1 (Access Control) requires..."
}
],
"metadata": {
"source": "NIST SP 800-171 R3",
"cmmc_level": "2",
"cmmc_domain": "Access Control",
"cmmc_practice_id": "AC.L2-3.1.1",
"nist_control": "3.1.1",
"type": "cmmc_requirement"
}
}
### 向量嵌入
预计算的嵌入向量使用OpenAI的`text-embedding-3-small`模型生成:
- **格式**:包含1536维向量的Parquet文件
- **文件名称**:`embeddings_train.parquet`、`embeddings_valid.parquet`
- **总大小**:15.3 MB(训练集12.1 MB + 验证集3.2 MB)
- **处理成本**:0.01美元(共处理330784个Token)
### FAISS索引
可直接使用的向量相似度搜索索引:
- **L2距离索引**:`faiss_train_l2.index`、`faiss_valid_l2.index`
- **余弦相似度索引**:`faiss_train_cosine.index`、`faiss_valid_cosine.index`
## 问答生成策略
本数据集的样本通过5种互补策略生成:
1. **基于章节的问答**:从文档章节提取问题
2. **基于控制项的问答**:针对NIST控制项(格式如3.1.1)生成问题
3. **面向CMMC的问答**:聚焦CMMC等级的问题(L1/L2/L3)
4. **面向域的问答**:针对每个CMMC域生成问题
5. **语义分块**:保留上下文的通用内容生成问答
## 应用场景
本核心数据集适用于以下场景:
- 针对CMMC合规性要求对大语言模型(Large Language Model,LLM)进行微调
- 构建用于合规指导的CMMC聊天机器人
- 搭建面向CMMC文档的检索增强生成(Retrieval-Augmented Generation,RAG)系统
- 实现CMMC控制项的语义搜索
- 为CMMC评估人员提供培训材料
- 开发合规自动化工具
## 数据集统计信息
源文献数量: 14
总样本数: 1244
训练集样本数: 995(占比80%)
验证集样本数: 249(占比20%)
单样本平均长度: 约266个Token
总嵌入处理Token数: 330784
嵌入处理成本: 0.01美元
## 快速入门
### 加载JSONL数据
python
import json
# 加载训练数据
with open('train.jsonl', 'r', encoding='utf-8') as f:
train_data = [json.loads(line) for line in f]
# 示例:访问第一条训练样本
print(train_data[0]['messages'])
print(train_data[0]['metadata'])
### 加载向量嵌入
python
import pandas as pd
import numpy as np
# 加载嵌入向量
df = pd.read_parquet('embeddings_train.parquet')
# 将嵌入向量转换为numpy数组
embeddings = np.vstack(df['embedding'].values)
texts = df['text'].tolist()
print(f"嵌入向量形状:{embeddings.shape}") # (995, 1536)
### 使用FAISS索引
python
import faiss
# 加载FAISS索引
index = faiss.read_index('faiss_train_cosine.index')
# 搜索相似内容
query_embedding = ... # 你的查询向量(1536维)
k = 5 # 返回结果数量
distances, indices = index.search(query_embedding.reshape(1, -1), k)
# 获取相似文本
for i, idx in enumerate(indices[0]):
print(f"{i+1}. {texts[idx][:100]}...")
## 相关数据集
本数据集属于包含3个CMMC数据集的家族:
- **核心版**(本数据集):14份文档,1.2K条样本——核心CMMC基础数据集
- **均衡版**:71份文档,2.8K条样本——域均衡覆盖数据集
- **全量版**:381份文档,11.3K条样本——完整NIST CMMC库数据集
## 引用方式
若使用本数据集,请引用以下内容:
bibtex
@dataset{cmmc_core_2025,
title={CMMC Training Dataset - Core Variant},
author={Troy, Ethan Oliver},
year={2025},
publisher={HuggingFace},
note={Derived from NIST Special Publications (Public Domain)}
}
## 许可证
**公共领域**——本数据集源自NIST特别出版物,此类作品属于美国政府作品,在美国境内不受版权保护。
## 致谢
本数据集基于美国国家标准与技术研究院(NIST)计算机安全资源中心(CSRC)发布的出版物构建。
## 数据集版本
- **版本**:1.0
- **创建时间**:2025年
- **源数据**:NIST CSRC出版物
- **处理流程**:Docling + 自定义CMMC感知数据预处理
## 联系方式
如有疑问或问题,请在GitHub仓库中提交Issue。
提供机构:
maas
创建时间:
2025-10-29



