Floressek/wiki-1m-qdrant-snapshot
收藏Hugging Face2025-11-15 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Floressek/wiki-1m-qdrant-snapshot
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_name: wiki-1m-qdrant-snapshot
pretty_name: Wikipedia 1M — GTE Multilingual Embeddings (Qdrant Snapshot)
license: cc-by-sa-4.0
language:
- pl
tags:
- wikipedia
- embeddings
- qdrant
- vector-database
- rag
- gte
- multilingual
- 768d
size: 7GB
size_categories:
- 1M<n<10M
---
# Wikipedia 1M — Embedding Snapshot (Qdrant, 768D, GTE Multilingual Base)
This dataset contains a **7GB Qdrant snapshot** with **1,000,000 Polish Wikipedia passages**, embedded using:
- **Model:** `Alibaba-NLP/gte-multilingual-base`
- **Embedding dimension:** `768`
- **Distance metric:** `cosine`
- **Index type:** HNSW (`M=32`, `ef_construct=256`, on-disk enabled)
- **Chunking strategy:** *semantic*, max chunk size 512, overlap 128
- **Payloads:** include passage text + metadata
The snapshot can be restored directly using the Qdrant client.
Because the content originates from **Wikipedia**, the dataset is distributed under
**CC-BY-SA-4.0**, in accordance with the original CC-BY-SA-3.0 license.
---
## Dataset Details
### Source
The dataset consists of the **first 1M processed Wikipedia passages**, chunked and embedded via the RAGx pipeline:
- **Chunker model:** `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`
- **Chunker strategy:** semantic segmentation, section-aware
- **Embedding model:** `Alibaba-NLP/gte-multilingual-base`
- **Query prefix:** `query:`
- **Passage prefix:** `passage:`
- **Max seq length:** 512
### Qdrant Configuration
- **Collection name:** `ragx_documents_1M_main_sample`
- **Vectors:** 1,000,000
- **Dimensionality:** 768
- **Distance:** cosine
- **Index:** HNSW, on-disk enabled
- **Search EF:** 256
### Contents
The snapshot contains:
- `vectors` (1M embeddings)
- `payloads` (raw chunk text, document metadata)
- Qdrant index structure (HNSW, WAL, snapshot)
---
## Uses
### Direct Use
- Retrieval-Augmented Generation (RAG)
- Hybrid retrievers with cross-encoders (e.g., Jina Reranker)
- Multi-hop retrieval
- Semantic search benchmarking
- Qdrant index bootstrapping
- Testing LLM-based chain-of-verification systems (CoVe)
### Out-of-Scope Use
- Reconstructing original Wikipedia pages from embeddings
- Tasks requiring full textual content without attribution
---
## Loading the Snapshot
### Python
```python
from huggingface_hub import hf_hub_download
from qdrant_client import QdrantClient
path = hf_hub_download(
repo_id="floressek/wiki-1m-qdrant-snapshot",
filename="wiki_1m_qdrant.snapshot",
repo_type="dataset"
)
client = QdrantClient(path=path, storage="snapshot")
````
### Qdrant CLI
```
qdrant snapshot recover wiki_1m_qdrant.snapshot
```
---
## Licensing
Wikipedia text is licensed under **CC-BY-SA 3.0**, and therefore all derivative works — including embeddings — must follow a share-alike license.
This dataset is released under **CC-BY-SA 4.0**.
---
## Citation
If you use this dataset, please cite:
**Wikipedia:**
```
Wikipedia contributors. (2024). Wikipedia, The Free Encyclopedia.
```
**GTE Multilingual Base:**
```
Alibaba-NLP/gte-multilingual-base
```
**Qdrant:**
```
Qdrant: Scalable Vector Search Engine. https://qdrant.tech
```
---
## Dataset Card Contact
Maintainer: **Floressek**
Questions / issues: open an Issue in this repo.
```
提供机构:
Floressek



