lightonai/nq-decontaminated
收藏Hugging Face2026-03-25 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/lightonai/nq-decontaminated
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- eng
license: mit
task_categories:
- text-retrieval
task_ids:
- document-retrieval
tags:
- decontaminated
- beir
- information-retrieval
configs:
- config_name: corpus
data_files:
- split: corpus
path: corpus.parquet
- config_name: queries
data_files:
- split: queries
path: queries.parquet
- config_name: qrels-test
data_files:
- split: test
path: qrels_test.parquet
---
# nq (Decontaminated)
A decontaminated version of the [nq](https://huggingface.co/datasets/BeIR/nq) dataset from the BEIR benchmark, with samples found in the [mgte-en](https://huggingface.co/datasets/Alibaba-NLP/mgte-en) pre-training dataset removed.
## Decontamination methodology
Contamination was detected using a two-pass approach against the full mgte-en dataset (484 GB, 1,235 parquet files):
### Pass 1: Exact hash matching
All texts (queries and corpus documents) were normalized (lowercased, unicode NFKD, whitespace collapsed) and hashed with xxHash-64. The same normalization + hashing was applied to every `query` and `document` field in mgte-en. Any sample whose hash appeared in mgte-en was flagged as contaminated.
### Pass 2: 13-gram containment (GPT-3 style)
Following the methodology introduced in the GPT-3 paper (Brown et al., 2020), word-level 13-grams were extracted from all remaining samples. For each sample, containment was computed as:
```
containment = |ngrams_in_sample ∩ ngrams_in_mgte| / |ngrams_in_sample|
```
Samples with containment >= 0.5 were flagged as near-duplicates.
### Qrels filtering
Relevance judgments (qrels) referencing any removed query or corpus document were also removed.
## Decontamination results
| Component | Original | Clean | Removed |
|---|---|---|---|
| Corpus | 2,681,468 | 305,674 | 2,375,794 |
| Queries | 3,452 | 3,127 | 325 |
### Qrels per split
| Split | Original | Clean | Removed |
|---|---|---|---|
| test | 4,201 | 26 | 4,175 |
## Usage
```python
from datasets import load_dataset
corpus = load_dataset("lightonai/nq-decontaminated", "corpus", split="corpus")
queries = load_dataset("lightonai/nq-decontaminated", "queries", split="queries")
```
## Citation
Please cite the original BEIR benchmark:
```bibtex
@inproceedings{thakur2021beir,
title={BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
author={Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Irena},
booktitle={NeurIPS Datasets and Benchmarks},
year={2021}
}
```
## License
MIT (same as original BEIR)
提供机构:
lightonai



