davenporten/nrc-regulatory-embeddings

Name: davenporten/nrc-regulatory-embeddings
Creator: davenporten
Published: 2026-04-08 02:34:13
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/davenporten/nrc-regulatory-embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-retrieval - question-answering language: - en tags: - nuclear - NRC - regulatory - RAG - embeddings - licensing pretty_name: NRC Regulatory Embeddings size_categories: - 10K<n<100K --- # NRC Regulatory Embeddings 37,734 chunked and embedded NRC nuclear regulatory documents, ready for use in RAG pipelines. Built for the [nrc-licensing-rag](https://github.com/Davenporten/nrc-licensing-rag) project, an AI system for analyzing nuclear Combined License Applications (COLAs). ## Contents | Source | Documents | |--------|-----------| | NUREG-0800 (Standard Review Plan) chapters 1-19 | 2,436 sections | | 10 CFR Parts 20, 50, 51, 52, 72, 73, 100 | ~504 sections | | Regulatory Guide Division 1 (1.1-1.262) | 242 guides | | Regulatory Guide Division 4 (4.1-4.28) | 27 guides | | **Total chunks** | **37,734** | ## Schema | Column | Type | Description | |--------|------|-------------| | `id` | string | Unique chunk ID | | `text` | string | Document chunk text | | `embedding` | list[float] | 1536-dim OpenAI `text-embedding-3-small` vector | | `source` | string | Source identifier (e.g. `nureg_0800`, `10cfr50`, `reg_guide`) | | `document_type` | string | `srp`, `cfr`, or `reg_guide` | | `document_id` | string | Document identifier | | `title` | string | Section or guide title | | `section_id` | string | NRC section number | | `chapter` | string | Chapter (for SRP documents) | | `chunk_index` | int | Position of chunk within source document | | `source_url` | string | NRC.gov URL where available | | `guide_id` | string | Regulatory Guide number (e.g. `1.1`) | | `cfr_part` | string | CFR part number | | `division` | string | Regulatory Guide division | ## Usage ```python from datasets import load_dataset ds = load_dataset("davenporten/nrc-regulatory-embeddings") df = ds["train"].to_pandas() ``` Or load directly with pandas: ```python import pandas as pd df = pd.read_parquet("hf://datasets/davenporten/nrc-regulatory-embeddings/data/nrc-regulatory-embeddings.parquet") ``` ### Load into ChromaDB ```python import chromadb import pandas as pd df = pd.read_parquet("hf://datasets/davenporten/nrc-regulatory-embeddings/data/nrc-regulatory-embeddings.parquet") client = chromadb.HttpClient(host="localhost", port=8000) col = client.get_or_create_collection("regulations") batch_size = 500 for i in range(0, len(df), batch_size): batch = df.iloc[i:i+batch_size] col.add( ids=batch["id"].tolist(), documents=batch["text"].tolist(), embeddings=batch["embedding"].tolist(), metadatas=batch.drop(columns=["id", "text", "embedding"]).to_dict("records"), ) ``` ## Embeddings Generated with OpenAI `text-embedding-3-small` (1536 dimensions). To query without re-embedding your documents, use the same model. ## License MIT, documents are sourced from publicly available NRC publications.

提供机构：

davenporten

5,000+

优质数据集

54 个

任务类型

进入经典数据集