lightonai/climate-fever-decontaminated
收藏Hugging Face2026-03-25 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/lightonai/climate-fever-decontaminated
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- eng
license: mit
task_categories:
- text-retrieval
task_ids:
- document-retrieval
tags:
- decontaminated
- beir
- information-retrieval
configs:
- config_name: corpus
data_files:
- split: corpus
path: corpus.parquet
- config_name: queries
data_files:
- split: queries
path: queries.parquet
- config_name: qrels-test
data_files:
- split: test
path: qrels_test.parquet
---
# climate-fever (Decontaminated)
A decontaminated version of the [climate-fever](https://huggingface.co/datasets/BeIR/climate-fever) dataset from the BEIR benchmark, with samples found in the [mgte-en](https://huggingface.co/datasets/Alibaba-NLP/mgte-en) pre-training dataset removed.
## Decontamination methodology
Contamination was detected using a two-pass approach against the full mgte-en dataset (484 GB, 1,235 parquet files):
### Pass 1: Exact hash matching
All texts (queries and corpus documents) were normalized (lowercased, unicode NFKD, whitespace collapsed) and hashed with xxHash-64. The same normalization + hashing was applied to every `query` and `document` field in mgte-en. Any sample whose hash appeared in mgte-en was flagged as contaminated.
### Pass 2: 13-gram containment (GPT-3 style)
Following the methodology introduced in the GPT-3 paper (Brown et al., 2020), word-level 13-grams were extracted from all remaining samples. For each sample, containment was computed as:
```
containment = |ngrams_in_sample ∩ ngrams_in_mgte| / |ngrams_in_sample|
```
Samples with containment >= 0.5 were flagged as near-duplicates.
### Qrels filtering
Relevance judgments (qrels) referencing any removed query or corpus document were also removed.
## Decontamination results
| Component | Original | Clean | Removed |
|---|---|---|---|
| Corpus | 5,416,593 | 5,117,453 | 299,140 |
| Queries | 1,535 | 974 | 561 |
### Qrels per split
| Split | Original | Clean | Removed |
|---|---|---|---|
| test | 4,681 | 2,700 | 1,981 |
## Usage
```python
from datasets import load_dataset
corpus = load_dataset("lightonai/climate-fever-decontaminated", "corpus", split="corpus")
queries = load_dataset("lightonai/climate-fever-decontaminated", "queries", split="queries")
```
## Citation
Please cite the original BEIR benchmark:
```bibtex
@inproceedings{thakur2021beir,
title={BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
author={Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Irena},
booktitle={NeurIPS Datasets and Benchmarks},
year={2021}
}
```
## License
MIT (same as original BEIR)
---
语言:
- 英语
许可证:MIT许可证
任务类别:
- 文本检索
任务子类型:
- 文档检索
标签:
- 去污染(decontaminated)
- BEIR
- 信息检索
配置项:
- 配置名称:语料库(corpus)
数据文件:
- 划分:语料库
路径:corpus.parquet
- 配置名称:查询集(queries)
数据文件:
- 划分:查询集
路径:queries.parquet
- 配置名称:查询相关性测试集(qrels-test)
数据文件:
- 划分:测试集
路径:qrels_test.parquet
---
# 去污染版climate-fever
本数据集为BEIR基准中[climate-fever](https://huggingface.co/datasets/BeIR/climate-fever)的去污染(decontaminated)版本,移除了存在于[mgte-en](https://huggingface.co/datasets/Alibaba-NLP/mgte-en)预训练数据集中的样本。
## 污染去除方法
污染检测采用双阶段流程,针对完整的mgte-en数据集(484 GB,共1235个parquet文件)开展检测:
### 第一阶段:精确哈希匹配
所有文本(包括查询与语料库文档)均经过标准化处理(转换为小写、应用Unicode NFKD标准化、折叠空白字符),并使用xxHash-64生成哈希值。同时对mgte-en数据集中所有`query`与`document`字段执行相同的标准化与哈希操作。若某一样本的哈希值存在于mgte-en数据集中,则标记为污染样本。
### 第二阶段:13元组包含度检测(GPT-3式)
参考GPT-3论文(Brown等人,2020)中提出的方法,从所有剩余样本中提取词级13元组。针对每个样本,包含度计算公式如下:
包含度 = |样本中13元组集合 ∩ mgte-en中13元组集合| / |样本中13元组集合|
包含度≥0.5的样本将被标记为近似重复样本。
### 查询相关性标签过滤
所有引用了已移除的查询或语料库文档的相关性标注(qrels)也将被一并移除。
## 污染去除结果
| 组件类型 | 原始样本数 | 去污染后样本数 | 移除样本数 |
|---|---|---|---|
| 语料库 | 5,416,593 | 5,117,453 | 299,140 |
| 查询集 | 1,535 | 974 | 561 |
### 各划分集的查询相关性标签统计
| 划分集 | 原始标注数 | 去污染后标注数 | 移除标注数 |
|---|---|---|---|
| 测试集 | 4,681 | 2,700 | 1,981 |
## 使用方法
python
from datasets import load_dataset
corpus = load_dataset("lightonai/climate-fever-decontaminated", "corpus", split="corpus")
queries = load_dataset("lightonai/climate-fever-decontaminated", "queries", split="queries")
## 引用信息
请引用原始BEIR基准:
bibtex
@inproceedings{thakur2021beir,
title={BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
author={Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Irena},
booktitle={NeurIPS Datasets and Benchmarks},
year={2021}
}
## 许可证
MIT许可证(与原始BEIR数据集一致)
提供机构:
lightonai



