five

Floressek/wiki-1m-qdrant-snapshot

收藏
Hugging Face2025-11-15 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Floressek/wiki-1m-qdrant-snapshot
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_name: wiki-1m-qdrant-snapshot pretty_name: Wikipedia 1M — GTE Multilingual Embeddings (Qdrant Snapshot) license: cc-by-sa-4.0 language: - pl tags: - wikipedia - embeddings - qdrant - vector-database - rag - gte - multilingual - 768d size: 7GB size_categories: - 1M<n<10M --- # Wikipedia 1M — Embedding Snapshot (Qdrant, 768D, GTE Multilingual Base) This dataset contains a **7GB Qdrant snapshot** with **1,000,000 Polish Wikipedia passages**, embedded using: - **Model:** `Alibaba-NLP/gte-multilingual-base` - **Embedding dimension:** `768` - **Distance metric:** `cosine` - **Index type:** HNSW (`M=32`, `ef_construct=256`, on-disk enabled) - **Chunking strategy:** *semantic*, max chunk size 512, overlap 128 - **Payloads:** include passage text + metadata The snapshot can be restored directly using the Qdrant client. Because the content originates from **Wikipedia**, the dataset is distributed under **CC-BY-SA-4.0**, in accordance with the original CC-BY-SA-3.0 license. --- ## Dataset Details ### Source The dataset consists of the **first 1M processed Wikipedia passages**, chunked and embedded via the RAGx pipeline: - **Chunker model:** `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` - **Chunker strategy:** semantic segmentation, section-aware - **Embedding model:** `Alibaba-NLP/gte-multilingual-base` - **Query prefix:** `query:` - **Passage prefix:** `passage:` - **Max seq length:** 512 ### Qdrant Configuration - **Collection name:** `ragx_documents_1M_main_sample` - **Vectors:** 1,000,000 - **Dimensionality:** 768 - **Distance:** cosine - **Index:** HNSW, on-disk enabled - **Search EF:** 256 ### Contents The snapshot contains: - `vectors` (1M embeddings) - `payloads` (raw chunk text, document metadata) - Qdrant index structure (HNSW, WAL, snapshot) --- ## Uses ### Direct Use - Retrieval-Augmented Generation (RAG) - Hybrid retrievers with cross-encoders (e.g., Jina Reranker) - Multi-hop retrieval - Semantic search benchmarking - Qdrant index bootstrapping - Testing LLM-based chain-of-verification systems (CoVe) ### Out-of-Scope Use - Reconstructing original Wikipedia pages from embeddings - Tasks requiring full textual content without attribution --- ## Loading the Snapshot ### Python ```python from huggingface_hub import hf_hub_download from qdrant_client import QdrantClient path = hf_hub_download( repo_id="floressek/wiki-1m-qdrant-snapshot", filename="wiki_1m_qdrant.snapshot", repo_type="dataset" ) client = QdrantClient(path=path, storage="snapshot") ```` ### Qdrant CLI ``` qdrant snapshot recover wiki_1m_qdrant.snapshot ``` --- ## Licensing Wikipedia text is licensed under **CC-BY-SA 3.0**, and therefore all derivative works — including embeddings — must follow a share-alike license. This dataset is released under **CC-BY-SA 4.0**. --- ## Citation If you use this dataset, please cite: **Wikipedia:** ``` Wikipedia contributors. (2024). Wikipedia, The Free Encyclopedia. ``` **GTE Multilingual Base:** ``` Alibaba-NLP/gte-multilingual-base ``` **Qdrant:** ``` Qdrant: Scalable Vector Search Engine. https://qdrant.tech ``` --- ## Dataset Card Contact Maintainer: **Floressek** Questions / issues: open an Issue in this repo. ```
提供机构:
Floressek
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作