five

davenporten/nrc-regulatory-embeddings

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/davenporten/nrc-regulatory-embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-retrieval - question-answering language: - en tags: - nuclear - NRC - regulatory - RAG - embeddings - licensing pretty_name: NRC Regulatory Embeddings size_categories: - 10K<n<100K --- # NRC Regulatory Embeddings 37,734 chunked and embedded NRC nuclear regulatory documents, ready for use in RAG pipelines. Built for the [nrc-licensing-rag](https://github.com/Davenporten/nrc-licensing-rag) project, an AI system for analyzing nuclear Combined License Applications (COLAs). ## Contents | Source | Documents | |--------|-----------| | NUREG-0800 (Standard Review Plan) chapters 1-19 | 2,436 sections | | 10 CFR Parts 20, 50, 51, 52, 72, 73, 100 | ~504 sections | | Regulatory Guide Division 1 (1.1-1.262) | 242 guides | | Regulatory Guide Division 4 (4.1-4.28) | 27 guides | | **Total chunks** | **37,734** | ## Schema | Column | Type | Description | |--------|------|-------------| | `id` | string | Unique chunk ID | | `text` | string | Document chunk text | | `embedding` | list[float] | 1536-dim OpenAI `text-embedding-3-small` vector | | `source` | string | Source identifier (e.g. `nureg_0800`, `10cfr50`, `reg_guide`) | | `document_type` | string | `srp`, `cfr`, or `reg_guide` | | `document_id` | string | Document identifier | | `title` | string | Section or guide title | | `section_id` | string | NRC section number | | `chapter` | string | Chapter (for SRP documents) | | `chunk_index` | int | Position of chunk within source document | | `source_url` | string | NRC.gov URL where available | | `guide_id` | string | Regulatory Guide number (e.g. `1.1`) | | `cfr_part` | string | CFR part number | | `division` | string | Regulatory Guide division | ## Usage ```python from datasets import load_dataset ds = load_dataset("davenporten/nrc-regulatory-embeddings") df = ds["train"].to_pandas() ``` Or load directly with pandas: ```python import pandas as pd df = pd.read_parquet("hf://datasets/davenporten/nrc-regulatory-embeddings/data/nrc-regulatory-embeddings.parquet") ``` ### Load into ChromaDB ```python import chromadb import pandas as pd df = pd.read_parquet("hf://datasets/davenporten/nrc-regulatory-embeddings/data/nrc-regulatory-embeddings.parquet") client = chromadb.HttpClient(host="localhost", port=8000) col = client.get_or_create_collection("regulations") batch_size = 500 for i in range(0, len(df), batch_size): batch = df.iloc[i:i+batch_size] col.add( ids=batch["id"].tolist(), documents=batch["text"].tolist(), embeddings=batch["embedding"].tolist(), metadatas=batch.drop(columns=["id", "text", "embedding"]).to_dict("records"), ) ``` ## Embeddings Generated with OpenAI `text-embedding-3-small` (1536 dimensions). To query without re-embedding your documents, use the same model. ## License MIT, documents are sourced from publicly available NRC publications.
提供机构:
davenporten
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作