theodi/ndl-core-rag-index
收藏Hugging Face2026-01-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/theodi/ndl-core-rag-index
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- cy
tags:
- Rag
pretty_name: NDL Core RAG Index
size_categories:
- 100M<n<1B
---
# NDL Core RAG Index
This dataset contains a FAISS index and associated chunk metadata to support
retrieval-augmented generation (RAG) use cases on [ndl-core-corpus](https://huggingface.co/datasets/theodi/ndl-core-corpus).
---
## Overview
- Model: sentence-transformers/all-MiniLM-L6-v2
- Dimension: 384
- Normalisation: L2
- Similarity: cosine (inner product)
### Chunking
- Strategy: recursive character-based chunking
- Chunk size: 800 characters
- Overlap: 100 characters
### Index–Metadata Alignment
The FAISS index (`index.faiss`) and the chunk metadata file
(`data/ndl_core_rag_index.parquet`) are **strictly index-aligned**.
This means:
- The *n-th* embedding vector in `index.faiss` corresponds exactly to
the *n-th* row in `data/ndl_core_rag_index.parquet`
- Retrieved FAISS indices can be used directly to look up chunk text,
source identifiers, and metadata in the parquet file
This guarantees deterministic and reliable mapping from similarity search
results back to their original source records.
---
## LanceDB Search Index
A LanceDB-based search index has been added to support searching for NDL Core datasets by topic and downloading them. This index uses the same `all-MiniLM-L6-v2` model and generates embedding vectors based on the concatenation of the title, description, and the first 500 characters of the text. The LanceDB index includes the full records for retrieval.
## Source data
Chunks reference records are in the dataset:
[ndl-core-corpus](https://huggingface.co/datasets/theodi/ndl-core-corpus)
See `rag_config.json` for full, machine-readable configuration.
---
## Example Application
This index is used in a live retrieval-augmented chat application:
🔗 **NDL Core RAG Chat**
https://huggingface.co/spaces/theodi/ndl-core-rag-chat
The application demonstrates:
- Semantic retrieval over UK public sector data
- Deterministic citation of source records
- End-to-end RAG using the published FAISS index and metadata
---
提供机构:
theodi



