PaczkiLives/daemon-wiki-faiss
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/PaczkiLives/daemon-wiki-faiss
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- feature-extraction
- text-retrieval
language:
- en
pretty_name: Daemon Wiki FAISS Index
size_categories:
- 10M<n<100M
tags:
- faiss
- wikipedia
- embeddings
- retrieval
---
# Daemon Wiki FAISS Index
Pre-built FAISS IVFPQ index and metadata for the [Daemon](https://github.com/lukehalleran/Daemon) conversational RAG system.
## Contents
| File | Size | Description |
|------|------|-------------|
| `vector_index_ivf.faiss` | ~2.2 GB | FAISS IVFPQ index (48 subquantizers x 8 bits, ~32x compression) |
| `metadata.parquet` | ~12 GB | Row-group metadata (titles, text, timestamps) for zero-copy lookup |
**Coverage:** ~41 million vectors from 6.5M+ English Wikipedia articles, embedded with `sentence-transformers/all-MiniLM-L6-v2` (384-dim).
## Usage
### Download
```bash
pip install huggingface_hub
# Download both files into a local directory
huggingface-cli download PaczkiLives/daemon-wiki-faiss \
--repo-type dataset \
--local-dir ~/daemon-wiki-data/wiki_data
```
### Point Daemon at the data
Set `WIKI_DATA_ROOT` to the **parent** directory of `wiki_data/`:
```bash
# If you downloaded to ~/daemon-wiki-data/wiki_data/
export WIKI_DATA_ROOT=~/daemon-wiki-data
# Then launch Daemon
python main.py
```
Or set individual paths directly:
```bash
export FAISS_INDEX_PATH=/path/to/wiki_data/vector_index_ivf.faiss
export FAISS_META_PATH=/path/to/wiki_data/metadata.parquet
```
### Runtime requirements
- **RAM:** ~2.6 GB (2.2 GB FAISS index + 0.4 GB embedding model). Metadata is read on-demand via zero-copy parquet row-group access — no DataFrame loaded into memory.
- **Disk:** ~14.5 GB for both files.
- **CPU:** Works on CPU. No GPU required.
## How it was built
```bash
# From the Daemon repo:
python scripts/build_faiss_index.py
```
The build pipeline:
1. Downloads the latest English Wikipedia dump (~22 GB compressed)
2. Parses XML, extracts article text
3. Chunks articles at ~512 tokens with header-aware splitting
4. Embeds chunks with `all-MiniLM-L6-v2` (384 dimensions)
5. Trains an IVF4096,PQ48 index on a sample, then adds all vectors
6. Writes metadata to a partitioned parquet file for zero-copy reads
## License
MIT — same as the Daemon project.
提供机构:
PaczkiLives



