five

lightonai/nq-decontaminated

收藏
Hugging Face2026-03-25 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/lightonai/nq-decontaminated
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - eng license: mit task_categories: - text-retrieval task_ids: - document-retrieval tags: - decontaminated - beir - information-retrieval configs: - config_name: corpus data_files: - split: corpus path: corpus.parquet - config_name: queries data_files: - split: queries path: queries.parquet - config_name: qrels-test data_files: - split: test path: qrels_test.parquet --- # nq (Decontaminated) A decontaminated version of the [nq](https://huggingface.co/datasets/BeIR/nq) dataset from the BEIR benchmark, with samples found in the [mgte-en](https://huggingface.co/datasets/Alibaba-NLP/mgte-en) pre-training dataset removed. ## Decontamination methodology Contamination was detected using a two-pass approach against the full mgte-en dataset (484 GB, 1,235 parquet files): ### Pass 1: Exact hash matching All texts (queries and corpus documents) were normalized (lowercased, unicode NFKD, whitespace collapsed) and hashed with xxHash-64. The same normalization + hashing was applied to every `query` and `document` field in mgte-en. Any sample whose hash appeared in mgte-en was flagged as contaminated. ### Pass 2: 13-gram containment (GPT-3 style) Following the methodology introduced in the GPT-3 paper (Brown et al., 2020), word-level 13-grams were extracted from all remaining samples. For each sample, containment was computed as: ``` containment = |ngrams_in_sample ∩ ngrams_in_mgte| / |ngrams_in_sample| ``` Samples with containment >= 0.5 were flagged as near-duplicates. ### Qrels filtering Relevance judgments (qrels) referencing any removed query or corpus document were also removed. ## Decontamination results | Component | Original | Clean | Removed | |---|---|---|---| | Corpus | 2,681,468 | 305,674 | 2,375,794 | | Queries | 3,452 | 3,127 | 325 | ### Qrels per split | Split | Original | Clean | Removed | |---|---|---|---| | test | 4,201 | 26 | 4,175 | ## Usage ```python from datasets import load_dataset corpus = load_dataset("lightonai/nq-decontaminated", "corpus", split="corpus") queries = load_dataset("lightonai/nq-decontaminated", "queries", split="queries") ``` ## Citation Please cite the original BEIR benchmark: ```bibtex @inproceedings{thakur2021beir, title={BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models}, author={Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Irena}, booktitle={NeurIPS Datasets and Benchmarks}, year={2021} } ``` ## License MIT (same as original BEIR)
提供机构:
lightonai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作