PaczkiLives/daemon-wiki-faiss

Name: PaczkiLives/daemon-wiki-faiss
Creator: PaczkiLives
Published: 2026-04-03 16:06:33
License: 暂无描述

Hugging Face2026-04-03 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/PaczkiLives/daemon-wiki-faiss

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - feature-extraction - text-retrieval language: - en pretty_name: Daemon Wiki FAISS Index size_categories: - 10M<n<100M tags: - faiss - wikipedia - embeddings - retrieval --- # Daemon Wiki FAISS Index Pre-built FAISS IVFPQ index and metadata for the [Daemon](https://github.com/lukehalleran/Daemon) conversational RAG system. ## Contents | File | Size | Description | |------|------|-------------| | `vector_index_ivf.faiss` | ~2.2 GB | FAISS IVFPQ index (48 subquantizers x 8 bits, ~32x compression) | | `metadata.parquet` | ~12 GB | Row-group metadata (titles, text, timestamps) for zero-copy lookup | **Coverage:** ~41 million vectors from 6.5M+ English Wikipedia articles, embedded with `sentence-transformers/all-MiniLM-L6-v2` (384-dim). ## Usage ### Download ```bash pip install huggingface_hub # Download both files into a local directory huggingface-cli download PaczkiLives/daemon-wiki-faiss \ --repo-type dataset \ --local-dir ~/daemon-wiki-data/wiki_data ``` ### Point Daemon at the data Set `WIKI_DATA_ROOT` to the **parent** directory of `wiki_data/`: ```bash # If you downloaded to ~/daemon-wiki-data/wiki_data/ export WIKI_DATA_ROOT=~/daemon-wiki-data # Then launch Daemon python main.py ``` Or set individual paths directly: ```bash export FAISS_INDEX_PATH=/path/to/wiki_data/vector_index_ivf.faiss export FAISS_META_PATH=/path/to/wiki_data/metadata.parquet ``` ### Runtime requirements - **RAM:** ~2.6 GB (2.2 GB FAISS index + 0.4 GB embedding model). Metadata is read on-demand via zero-copy parquet row-group access — no DataFrame loaded into memory. - **Disk:** ~14.5 GB for both files. - **CPU:** Works on CPU. No GPU required. ## How it was built ```bash # From the Daemon repo: python scripts/build_faiss_index.py ``` The build pipeline: 1. Downloads the latest English Wikipedia dump (~22 GB compressed) 2. Parses XML, extracts article text 3. Chunks articles at ~512 tokens with header-aware splitting 4. Embeds chunks with `all-MiniLM-L6-v2` (384 dimensions) 5. Trains an IVF4096,PQ48 index on a sample, then adds all vectors 6. Writes metadata to a partitioned parquet file for zero-copy reads ## License MIT — same as the Daemon project.

提供机构：

PaczkiLives

5,000+

优质数据集

54 个

任务类型

进入经典数据集