five

lightonai/climate-fever-decontaminated

收藏
Hugging Face2026-03-25 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/lightonai/climate-fever-decontaminated
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - eng license: mit task_categories: - text-retrieval task_ids: - document-retrieval tags: - decontaminated - beir - information-retrieval configs: - config_name: corpus data_files: - split: corpus path: corpus.parquet - config_name: queries data_files: - split: queries path: queries.parquet - config_name: qrels-test data_files: - split: test path: qrels_test.parquet --- # climate-fever (Decontaminated) A decontaminated version of the [climate-fever](https://huggingface.co/datasets/BeIR/climate-fever) dataset from the BEIR benchmark, with samples found in the [mgte-en](https://huggingface.co/datasets/Alibaba-NLP/mgte-en) pre-training dataset removed. ## Decontamination methodology Contamination was detected using a two-pass approach against the full mgte-en dataset (484 GB, 1,235 parquet files): ### Pass 1: Exact hash matching All texts (queries and corpus documents) were normalized (lowercased, unicode NFKD, whitespace collapsed) and hashed with xxHash-64. The same normalization + hashing was applied to every `query` and `document` field in mgte-en. Any sample whose hash appeared in mgte-en was flagged as contaminated. ### Pass 2: 13-gram containment (GPT-3 style) Following the methodology introduced in the GPT-3 paper (Brown et al., 2020), word-level 13-grams were extracted from all remaining samples. For each sample, containment was computed as: ``` containment = |ngrams_in_sample ∩ ngrams_in_mgte| / |ngrams_in_sample| ``` Samples with containment >= 0.5 were flagged as near-duplicates. ### Qrels filtering Relevance judgments (qrels) referencing any removed query or corpus document were also removed. ## Decontamination results | Component | Original | Clean | Removed | |---|---|---|---| | Corpus | 5,416,593 | 5,117,453 | 299,140 | | Queries | 1,535 | 974 | 561 | ### Qrels per split | Split | Original | Clean | Removed | |---|---|---|---| | test | 4,681 | 2,700 | 1,981 | ## Usage ```python from datasets import load_dataset corpus = load_dataset("lightonai/climate-fever-decontaminated", "corpus", split="corpus") queries = load_dataset("lightonai/climate-fever-decontaminated", "queries", split="queries") ``` ## Citation Please cite the original BEIR benchmark: ```bibtex @inproceedings{thakur2021beir, title={BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models}, author={Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Irena}, booktitle={NeurIPS Datasets and Benchmarks}, year={2021} } ``` ## License MIT (same as original BEIR)

--- 语言: - 英语 许可证:MIT许可证 任务类别: - 文本检索 任务子类型: - 文档检索 标签: - 去污染(decontaminated) - BEIR - 信息检索 配置项: - 配置名称:语料库(corpus) 数据文件: - 划分:语料库 路径:corpus.parquet - 配置名称:查询集(queries) 数据文件: - 划分:查询集 路径:queries.parquet - 配置名称:查询相关性测试集(qrels-test) 数据文件: - 划分:测试集 路径:qrels_test.parquet --- # 去污染版climate-fever 本数据集为BEIR基准中[climate-fever](https://huggingface.co/datasets/BeIR/climate-fever)的去污染(decontaminated)版本,移除了存在于[mgte-en](https://huggingface.co/datasets/Alibaba-NLP/mgte-en)预训练数据集中的样本。 ## 污染去除方法 污染检测采用双阶段流程,针对完整的mgte-en数据集(484 GB,共1235个parquet文件)开展检测: ### 第一阶段:精确哈希匹配 所有文本(包括查询与语料库文档)均经过标准化处理(转换为小写、应用Unicode NFKD标准化、折叠空白字符),并使用xxHash-64生成哈希值。同时对mgte-en数据集中所有`query`与`document`字段执行相同的标准化与哈希操作。若某一样本的哈希值存在于mgte-en数据集中,则标记为污染样本。 ### 第二阶段:13元组包含度检测(GPT-3式) 参考GPT-3论文(Brown等人,2020)中提出的方法,从所有剩余样本中提取词级13元组。针对每个样本,包含度计算公式如下: 包含度 = |样本中13元组集合 ∩ mgte-en中13元组集合| / |样本中13元组集合| 包含度≥0.5的样本将被标记为近似重复样本。 ### 查询相关性标签过滤 所有引用了已移除的查询或语料库文档的相关性标注(qrels)也将被一并移除。 ## 污染去除结果 | 组件类型 | 原始样本数 | 去污染后样本数 | 移除样本数 | |---|---|---|---| | 语料库 | 5,416,593 | 5,117,453 | 299,140 | | 查询集 | 1,535 | 974 | 561 | ### 各划分集的查询相关性标签统计 | 划分集 | 原始标注数 | 去污染后标注数 | 移除标注数 | |---|---|---|---| | 测试集 | 4,681 | 2,700 | 1,981 | ## 使用方法 python from datasets import load_dataset corpus = load_dataset("lightonai/climate-fever-decontaminated", "corpus", split="corpus") queries = load_dataset("lightonai/climate-fever-decontaminated", "queries", split="queries") ## 引用信息 请引用原始BEIR基准: bibtex @inproceedings{thakur2021beir, title={BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models}, author={Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Irena}, booktitle={NeurIPS Datasets and Benchmarks}, year={2021} } ## 许可证 MIT许可证(与原始BEIR数据集一致)
提供机构:
lightonai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作