five

anon-neurips-ed-2026/polymorphic-sybil-benchmark-data

收藏
Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/anon-neurips-ed-2026/polymorphic-sybil-benchmark-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 language: - en task_categories: - question-answering tags: - retrieval-augmented-generation - rag-benchmark - adversarial-robustness - retrieval-poisoning size_categories: - 10M<n<100M source_datasets: - natural_questions - hotpot_qa - trivia_qa pretty_name: "Polymorphic Sybil Benchmark — Retrieval Indexes" viewer: false --- # Polymorphic Sybil Benchmark: Retrieval Indexes Companion retrieval artifacts for the **Polymorphic Sybil Retrieval Poisoning Benchmark** (NeurIPS 2026 D&B submission, under review). This repository hosts the large pre-built indexes that are impractical to distribute via the code repository. **Code, manifests, and evaluator** live in a separate repository: [anonymous.4open.science/r/polymorphic-sybil-benchmark-code-8ED0](https://anonymous.4open.science/r/polymorphic-sybil-benchmark-code-8ED0/) ## Contents | Artifact | Size | Format | Use | |---|---:|---|---| | `bm25/` | ~11 GB | Lucene index | BM25 top-1000 → ColBERTv2 rerank → top-10 | | `e5_faiss/` | ~81 GB | FAISS IndexFlatIP (dim 1024) | E5-large-v2 top-200 → cross-encoder rerank → top-10 | ColBERTv2 is downloaded at runtime from the official HuggingFace checkpoint (`colbert-ir/colbertv2.0`) and does not require a pre-built index in this repository. The Wikipedia DPR 100-word corpus (21,015,324 passages) is embedded within the BM25 Lucene index; passage text is retrievable via Pyserini's `LuceneSearcher.doc()` interface. ## Quick start ```python from huggingface_hub import snapshot_download # Download only the BM25 index (~11 GB) local = snapshot_download( repo_id="anon-neurips-ed-2026/polymorphic-sybil-benchmark-data", repo_type="dataset", allow_patterns=["bm25/*"], ) print(f"BM25 index at: {local}/bm25/") ``` To download everything (~92 GB): ```python snapshot_download( repo_id="anon-neurips-ed-2026/polymorphic-sybil-benchmark-data", repo_type="dataset", ) ``` ## Source data The corpus is the **Wikipedia DPR 100-word split** (Karpukhin et al., 2020) — 21,015,324 passages. The benchmark builds on four open-domain QA datasets: | Dataset | Split | Pool size | |---|---|---:| | Natural Questions (NQ-open) | validation | 3,610 | | HotpotQA | distractor dev | 7,405 | | TriviaQA (unfiltered.nocontext) | validation | 11,313 | | 2WikiMultiHopQA | dev | 12,576 | ## License All artifacts: **CC BY-SA 4.0**. Inherits from Wikipedia (CC BY-SA 3.0→4.0), NQ (CC BY-SA 3.0), HotpotQA (CC BY-SA 4.0). Users redistributing must comply with upstream licenses. ## Citation ```bibtex @inproceedings{anon2026polymorphic, title = {Polymorphic Sybil Retrieval Poisoning Benchmark: A Failure-Mode-Aware Evaluation Framework for Grounded QA}, author = {Anonymous}, booktitle = {Advances in Neural Information Processing Systems (Datasets and Benchmarks Track)}, year = {2026}, note = {Under review} } ```
提供机构:
anon-neurips-ed-2026
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作