anon-neurips-ed-2026/polymorphic-sybil-benchmark-data
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/anon-neurips-ed-2026/polymorphic-sybil-benchmark-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
language:
- en
task_categories:
- question-answering
tags:
- retrieval-augmented-generation
- rag-benchmark
- adversarial-robustness
- retrieval-poisoning
size_categories:
- 10M<n<100M
source_datasets:
- natural_questions
- hotpot_qa
- trivia_qa
pretty_name: "Polymorphic Sybil Benchmark — Retrieval Indexes"
viewer: false
---
# Polymorphic Sybil Benchmark: Retrieval Indexes
Companion retrieval artifacts for the **Polymorphic Sybil Retrieval Poisoning Benchmark** (NeurIPS 2026 D&B submission, under review). This repository hosts the large pre-built indexes that are impractical to distribute via the code repository.
**Code, manifests, and evaluator** live in a separate repository:
[anonymous.4open.science/r/polymorphic-sybil-benchmark-code-8ED0](https://anonymous.4open.science/r/polymorphic-sybil-benchmark-code-8ED0/)
## Contents
| Artifact | Size | Format | Use |
|---|---:|---|---|
| `bm25/` | ~11 GB | Lucene index | BM25 top-1000 → ColBERTv2 rerank → top-10 |
| `e5_faiss/` | ~81 GB | FAISS IndexFlatIP (dim 1024) | E5-large-v2 top-200 → cross-encoder rerank → top-10 |
ColBERTv2 is downloaded at runtime from the official HuggingFace checkpoint (`colbert-ir/colbertv2.0`) and does not require a pre-built index in this repository.
The Wikipedia DPR 100-word corpus (21,015,324 passages) is embedded within the BM25 Lucene index; passage text is retrievable via Pyserini's `LuceneSearcher.doc()` interface.
## Quick start
```python
from huggingface_hub import snapshot_download
# Download only the BM25 index (~11 GB)
local = snapshot_download(
repo_id="anon-neurips-ed-2026/polymorphic-sybil-benchmark-data",
repo_type="dataset",
allow_patterns=["bm25/*"],
)
print(f"BM25 index at: {local}/bm25/")
```
To download everything (~92 GB):
```python
snapshot_download(
repo_id="anon-neurips-ed-2026/polymorphic-sybil-benchmark-data",
repo_type="dataset",
)
```
## Source data
The corpus is the **Wikipedia DPR 100-word split** (Karpukhin et al., 2020) — 21,015,324 passages. The benchmark builds on four open-domain QA datasets:
| Dataset | Split | Pool size |
|---|---|---:|
| Natural Questions (NQ-open) | validation | 3,610 |
| HotpotQA | distractor dev | 7,405 |
| TriviaQA (unfiltered.nocontext) | validation | 11,313 |
| 2WikiMultiHopQA | dev | 12,576 |
## License
All artifacts: **CC BY-SA 4.0**. Inherits from Wikipedia (CC BY-SA 3.0→4.0), NQ (CC BY-SA 3.0), HotpotQA (CC BY-SA 4.0). Users redistributing must comply with upstream licenses.
## Citation
```bibtex
@inproceedings{anon2026polymorphic,
title = {Polymorphic Sybil Retrieval Poisoning Benchmark:
A Failure-Mode-Aware Evaluation Framework for Grounded QA},
author = {Anonymous},
booktitle = {Advances in Neural Information Processing Systems (Datasets and Benchmarks Track)},
year = {2026},
note = {Under review}
}
```
提供机构:
anon-neurips-ed-2026



