five

PaczkiLives/daemon-wiki-faiss

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/PaczkiLives/daemon-wiki-faiss
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - feature-extraction - text-retrieval language: - en pretty_name: Daemon Wiki FAISS Index size_categories: - 10M<n<100M tags: - faiss - wikipedia - embeddings - retrieval --- # Daemon Wiki FAISS Index Pre-built FAISS IVFPQ index and metadata for the [Daemon](https://github.com/lukehalleran/Daemon) conversational RAG system. ## Contents | File | Size | Description | |------|------|-------------| | `vector_index_ivf.faiss` | ~2.2 GB | FAISS IVFPQ index (48 subquantizers x 8 bits, ~32x compression) | | `metadata.parquet` | ~12 GB | Row-group metadata (titles, text, timestamps) for zero-copy lookup | **Coverage:** ~41 million vectors from 6.5M+ English Wikipedia articles, embedded with `sentence-transformers/all-MiniLM-L6-v2` (384-dim). ## Usage ### Download ```bash pip install huggingface_hub # Download both files into a local directory huggingface-cli download PaczkiLives/daemon-wiki-faiss \ --repo-type dataset \ --local-dir ~/daemon-wiki-data/wiki_data ``` ### Point Daemon at the data Set `WIKI_DATA_ROOT` to the **parent** directory of `wiki_data/`: ```bash # If you downloaded to ~/daemon-wiki-data/wiki_data/ export WIKI_DATA_ROOT=~/daemon-wiki-data # Then launch Daemon python main.py ``` Or set individual paths directly: ```bash export FAISS_INDEX_PATH=/path/to/wiki_data/vector_index_ivf.faiss export FAISS_META_PATH=/path/to/wiki_data/metadata.parquet ``` ### Runtime requirements - **RAM:** ~2.6 GB (2.2 GB FAISS index + 0.4 GB embedding model). Metadata is read on-demand via zero-copy parquet row-group access — no DataFrame loaded into memory. - **Disk:** ~14.5 GB for both files. - **CPU:** Works on CPU. No GPU required. ## How it was built ```bash # From the Daemon repo: python scripts/build_faiss_index.py ``` The build pipeline: 1. Downloads the latest English Wikipedia dump (~22 GB compressed) 2. Parses XML, extracts article text 3. Chunks articles at ~512 tokens with header-aware splitting 4. Embeds chunks with `all-MiniLM-L6-v2` (384 dimensions) 5. Trains an IVF4096,PQ48 index on a sample, then adds all vectors 6. Writes metadata to a partitioned parquet file for zero-copy reads ## License MIT — same as the Daemon project.
提供机构:
PaczkiLives
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作