anon-neurips-ed-2026/polymorphic-sybil-benchmark-data

Name: anon-neurips-ed-2026/polymorphic-sybil-benchmark-data
Creator: anon-neurips-ed-2026
Published: 2026-04-28 07:44:17
License: 暂无描述

Hugging Face2026-04-28 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/anon-neurips-ed-2026/polymorphic-sybil-benchmark-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 language: - en task_categories: - question-answering tags: - retrieval-augmented-generation - rag-benchmark - adversarial-robustness - retrieval-poisoning size_categories: - 10M<n<100M source_datasets: - natural_questions - hotpot_qa - trivia_qa pretty_name: "Polymorphic Sybil Benchmark — Retrieval Indexes" viewer: false --- # Polymorphic Sybil Benchmark: Retrieval Indexes Companion retrieval artifacts for the **Polymorphic Sybil Retrieval Poisoning Benchmark** (NeurIPS 2026 D&B submission, under review). This repository hosts the large pre-built indexes that are impractical to distribute via the code repository. **Code, manifests, and evaluator** live in a separate repository: [anonymous.4open.science/r/polymorphic-sybil-benchmark-code-8ED0](https://anonymous.4open.science/r/polymorphic-sybil-benchmark-code-8ED0/) ## Contents | Artifact | Size | Format | Use | |---|---:|---|---| | `bm25/` | ~11 GB | Lucene index | BM25 top-1000 → ColBERTv2 rerank → top-10 | | `e5_faiss/` | ~81 GB | FAISS IndexFlatIP (dim 1024) | E5-large-v2 top-200 → cross-encoder rerank → top-10 | ColBERTv2 is downloaded at runtime from the official HuggingFace checkpoint (`colbert-ir/colbertv2.0`) and does not require a pre-built index in this repository. The Wikipedia DPR 100-word corpus (21,015,324 passages) is embedded within the BM25 Lucene index; passage text is retrievable via Pyserini's `LuceneSearcher.doc()` interface. ## Quick start ```python from huggingface_hub import snapshot_download # Download only the BM25 index (~11 GB) local = snapshot_download( repo_id="anon-neurips-ed-2026/polymorphic-sybil-benchmark-data", repo_type="dataset", allow_patterns=["bm25/*"], ) print(f"BM25 index at: {local}/bm25/") ``` To download everything (~92 GB): ```python snapshot_download( repo_id="anon-neurips-ed-2026/polymorphic-sybil-benchmark-data", repo_type="dataset", ) ``` ## Source data The corpus is the **Wikipedia DPR 100-word split** (Karpukhin et al., 2020) — 21,015,324 passages. The benchmark builds on four open-domain QA datasets: | Dataset | Split | Pool size | |---|---|---:| | Natural Questions (NQ-open) | validation | 3,610 | | HotpotQA | distractor dev | 7,405 | | TriviaQA (unfiltered.nocontext) | validation | 11,313 | | 2WikiMultiHopQA | dev | 12,576 | ## License All artifacts: **CC BY-SA 4.0**. Inherits from Wikipedia (CC BY-SA 3.0→4.0), NQ (CC BY-SA 3.0), HotpotQA (CC BY-SA 4.0). Users redistributing must comply with upstream licenses. ## Citation ```bibtex @inproceedings{anon2026polymorphic, title = {Polymorphic Sybil Retrieval Poisoning Benchmark: A Failure-Mode-Aware Evaluation Framework for Grounded QA}, author = {Anonymous}, booktitle = {Advances in Neural Information Processing Systems (Datasets and Benchmarks Track)}, year = {2026}, note = {Under review} } ```

提供机构：

anon-neurips-ed-2026

5,000+

优质数据集

54 个

任务类型

进入经典数据集