davenporten/nrc-regulatory-embeddings
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/davenporten/nrc-regulatory-embeddings
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-retrieval
- question-answering
language:
- en
tags:
- nuclear
- NRC
- regulatory
- RAG
- embeddings
- licensing
pretty_name: NRC Regulatory Embeddings
size_categories:
- 10K<n<100K
---
# NRC Regulatory Embeddings
37,734 chunked and embedded NRC nuclear regulatory documents, ready for use in RAG pipelines.
Built for the [nrc-licensing-rag](https://github.com/Davenporten/nrc-licensing-rag) project, an AI system for analyzing nuclear Combined License Applications (COLAs).
## Contents
| Source | Documents |
|--------|-----------|
| NUREG-0800 (Standard Review Plan) chapters 1-19 | 2,436 sections |
| 10 CFR Parts 20, 50, 51, 52, 72, 73, 100 | ~504 sections |
| Regulatory Guide Division 1 (1.1-1.262) | 242 guides |
| Regulatory Guide Division 4 (4.1-4.28) | 27 guides |
| **Total chunks** | **37,734** |
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `id` | string | Unique chunk ID |
| `text` | string | Document chunk text |
| `embedding` | list[float] | 1536-dim OpenAI `text-embedding-3-small` vector |
| `source` | string | Source identifier (e.g. `nureg_0800`, `10cfr50`, `reg_guide`) |
| `document_type` | string | `srp`, `cfr`, or `reg_guide` |
| `document_id` | string | Document identifier |
| `title` | string | Section or guide title |
| `section_id` | string | NRC section number |
| `chapter` | string | Chapter (for SRP documents) |
| `chunk_index` | int | Position of chunk within source document |
| `source_url` | string | NRC.gov URL where available |
| `guide_id` | string | Regulatory Guide number (e.g. `1.1`) |
| `cfr_part` | string | CFR part number |
| `division` | string | Regulatory Guide division |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("davenporten/nrc-regulatory-embeddings")
df = ds["train"].to_pandas()
```
Or load directly with pandas:
```python
import pandas as pd
df = pd.read_parquet("hf://datasets/davenporten/nrc-regulatory-embeddings/data/nrc-regulatory-embeddings.parquet")
```
### Load into ChromaDB
```python
import chromadb
import pandas as pd
df = pd.read_parquet("hf://datasets/davenporten/nrc-regulatory-embeddings/data/nrc-regulatory-embeddings.parquet")
client = chromadb.HttpClient(host="localhost", port=8000)
col = client.get_or_create_collection("regulations")
batch_size = 500
for i in range(0, len(df), batch_size):
batch = df.iloc[i:i+batch_size]
col.add(
ids=batch["id"].tolist(),
documents=batch["text"].tolist(),
embeddings=batch["embedding"].tolist(),
metadatas=batch.drop(columns=["id", "text", "embedding"]).to_dict("records"),
)
```
## Embeddings
Generated with OpenAI `text-embedding-3-small` (1536 dimensions). To query without re-embedding your documents, use the same model.
## License
MIT, documents are sourced from publicly available NRC publications.
提供机构:
davenporten



