lightonai/climate-fever-decontaminated

Name: lightonai/climate-fever-decontaminated
Creator: lightonai
Published: 2026-03-25 02:34:42
License: 暂无描述

Hugging Face2026-03-25 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/lightonai/climate-fever-decontaminated

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - eng license: mit task_categories: - text-retrieval task_ids: - document-retrieval tags: - decontaminated - beir - information-retrieval configs: - config_name: corpus data_files: - split: corpus path: corpus.parquet - config_name: queries data_files: - split: queries path: queries.parquet - config_name: qrels-test data_files: - split: test path: qrels_test.parquet --- # climate-fever (Decontaminated) A decontaminated version of the [climate-fever](https://huggingface.co/datasets/BeIR/climate-fever) dataset from the BEIR benchmark, with samples found in the [mgte-en](https://huggingface.co/datasets/Alibaba-NLP/mgte-en) pre-training dataset removed. ## Decontamination methodology Contamination was detected using a two-pass approach against the full mgte-en dataset (484 GB, 1,235 parquet files): ### Pass 1: Exact hash matching All texts (queries and corpus documents) were normalized (lowercased, unicode NFKD, whitespace collapsed) and hashed with xxHash-64. The same normalization + hashing was applied to every `query` and `document` field in mgte-en. Any sample whose hash appeared in mgte-en was flagged as contaminated. ### Pass 2: 13-gram containment (GPT-3 style) Following the methodology introduced in the GPT-3 paper (Brown et al., 2020), word-level 13-grams were extracted from all remaining samples. For each sample, containment was computed as: ``` containment = |ngrams_in_sample ∩ ngrams_in_mgte| / |ngrams_in_sample| ``` Samples with containment >= 0.5 were flagged as near-duplicates. ### Qrels filtering Relevance judgments (qrels) referencing any removed query or corpus document were also removed. ## Decontamination results | Component | Original | Clean | Removed | |---|---|---|---| | Corpus | 5,416,593 | 5,117,453 | 299,140 | | Queries | 1,535 | 974 | 561 | ### Qrels per split | Split | Original | Clean | Removed | |---|---|---|---| | test | 4,681 | 2,700 | 1,981 | ## Usage ```python from datasets import load_dataset corpus = load_dataset("lightonai/climate-fever-decontaminated", "corpus", split="corpus") queries = load_dataset("lightonai/climate-fever-decontaminated", "queries", split="queries") ``` ## Citation Please cite the original BEIR benchmark: ```bibtex @inproceedings{thakur2021beir, title={BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models}, author={Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Irena}, booktitle={NeurIPS Datasets and Benchmarks}, year={2021} } ``` ## License MIT (same as original BEIR)

--- 语言： - 英语许可证：MIT许可证任务类别： - 文本检索任务子类型： - 文档检索标签： - 去污染（decontaminated） - BEIR - 信息检索配置项： - 配置名称：语料库（corpus）数据文件： - 划分：语料库路径：corpus.parquet - 配置名称：查询集（queries）数据文件： - 划分：查询集路径：queries.parquet - 配置名称：查询相关性测试集（qrels-test）数据文件： - 划分：测试集路径：qrels_test.parquet --- # 去污染版climate-fever 本数据集为BEIR基准中[climate-fever](https://huggingface.co/datasets/BeIR/climate-fever)的去污染（decontaminated）版本，移除了存在于[mgte-en](https://huggingface.co/datasets/Alibaba-NLP/mgte-en)预训练数据集中的样本。 ## 污染去除方法污染检测采用双阶段流程，针对完整的mgte-en数据集（484 GB，共1235个parquet文件）开展检测： ### 第一阶段：精确哈希匹配所有文本（包括查询与语料库文档）均经过标准化处理（转换为小写、应用Unicode NFKD标准化、折叠空白字符），并使用xxHash-64生成哈希值。同时对mgte-en数据集中所有`query`与`document`字段执行相同的标准化与哈希操作。若某一样本的哈希值存在于mgte-en数据集中，则标记为污染样本。 ### 第二阶段：13元组包含度检测（GPT-3式）参考GPT-3论文（Brown等人，2020）中提出的方法，从所有剩余样本中提取词级13元组。针对每个样本，包含度计算公式如下：包含度 = |样本中13元组集合 ∩ mgte-en中13元组集合| / |样本中13元组集合| 包含度≥0.5的样本将被标记为近似重复样本。 ### 查询相关性标签过滤所有引用了已移除的查询或语料库文档的相关性标注（qrels）也将被一并移除。 ## 污染去除结果 | 组件类型 | 原始样本数 | 去污染后样本数 | 移除样本数 | |---|---|---|---| | 语料库 | 5,416,593 | 5,117,453 | 299,140 | | 查询集 | 1,535 | 974 | 561 | ### 各划分集的查询相关性标签统计 | 划分集 | 原始标注数 | 去污染后标注数 | 移除标注数 | |---|---|---|---| | 测试集 | 4,681 | 2,700 | 1,981 | ## 使用方法 python from datasets import load_dataset corpus = load_dataset("lightonai/climate-fever-decontaminated", "corpus", split="corpus") queries = load_dataset("lightonai/climate-fever-decontaminated", "queries", split="queries") ## 引用信息请引用原始BEIR基准： bibtex @inproceedings{thakur2021beir, title={BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models}, author={Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Irena}, booktitle={NeurIPS Datasets and Benchmarks}, year={2021} } ## 许可证 MIT许可证（与原始BEIR数据集一致）

提供机构：

lightonai

5,000+

优质数据集

54 个

任务类型

进入经典数据集