Contamination_Detector
收藏魔搭社区2025-11-25 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/opencompass/Contamination_Detector
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center">
<img src="https://github.com/liyucheng09/Contamination_Detector/blob/master/pics/logo.png" alt="Logo of Contamination Detector" width="auto" height="160" />
</p>
# Contamination Detector for LLMs Evaluation
Data Contamination is a pervasive and critical issue in the evaluation of Large Language Models (LLMs). Our **Contamination Detector** is designed to identify and analyze potential contamination issues without needing access to the LLMs' training data, enabling the community to audit LLMs evaluation results and conduct robust evaluation.
**News!!**
- Our new preprint: [An open source data contamination report for large language models](https://arxiv.org/abs/2310.17589)!
# Our Methods: check potential contamination via search engine
Contamination Detector checks whether test examples appear on the internet via **Bing search** and **Common Crawl index**. We categorize test samples into three subsets:
1. **Clean** set: the question and reference answer do not appear online.
2. **Input-only contaminated** set: the question appears online, but not its answer.
3. **Input-and-label contaminated** set: both question and answer appear online.
If either the "question" or "answer" of a test example is found online, this sample may have been included in the LLM's training data. As a result, LLMs might gain an **unfair advantage by 'remembering' these samples**, rather than genuinely **understanding or solving them**.
We now support the following popular LLMs benchmarks:
- MMLU
- CEval
- Winogrande
- ARC
- Hellaswag
- CommonsenseQA
# Get start: Test LLMs' degree of contamination
1. Clone the repository and install the required packages:
```
git clone https://github.com/liyucheng09/Contamination_Detector.git
cd Contamination_Detector/
pip install -r requirements.txt
```
2. We need model predictions to further analyze their data contamination issue. We have prepared model predictions for the following LLMs:
- LLaMA 7,13,30,65B
- Llama-2 7,13,70B
- Qwen-7b
- Baichuan2-7B
- Mistral-7B
- Mistral Instruct 7B
- Yi 6B
That you can download directly without going through the inference:
```
wget https://github.com/liyucheng09/Contamination_Detector/releases/download/v0.1.1rc2/model_predictions.zip
unzip model_predictions.zip
```
If you hope to conduct the analysis on your own prediction data, format your model prediction as following and put under `model_predictions/`:
```
{
"mmlu": {
"business_ethics 0": {
"gold": "C",
"pred": "A"
},
"business_ethics 1": {
"gold": "B",
"pred": "A"
},
"business_ethics 2": {
"gold": "D",
"pred": "A"
},
"business_ethics 3": {
"gold": "D",
"pred": "D"
},
"business_ethics 4": {
"gold": "B",
"pred": "B"
},
.....
```
3. Generate contamination analysis table:
```
python clean_dirty_comparison.py
```
This will use the contamination annotation under `reports/` to generate models' performance on the clean, input-only contaminated, and input-and-label contaminated subsets.
See how the performance of Llama-2 70B differs on the three subsets.
| Dataset | Condition | Llama-2 70B |
|-----------|--------------------|--------------|
| MMLU | Clean | .6763 |
| MMLU | All Dirty | .6667 ↓ |
| MMLU | Input-label Dirty | .7093 ↑ |
| Hellaswag | Clean | .7726 |
| Hellaswag | All Dirty | .8348 ↑ |
| Hellaswag | Input-label Dirty | .8455 ↑ |
| ARC | Clean | .4555 |
| ARC | All Dirty | .5632 ↑ |
| ARC | Input-label Dirty | .5667 ↑ |
| Average | Clean | .6348 |
| Average | All Dirty | .6882 ↑ |
| Average | Input-label Dirty | .7072 ↑ |
Other than this table, `clean_dirty_comparison.py` also produces a figure illustrating how the performance change with the recall score (the extent of contamination for a sample).
# Audit your own evaluation data
To check potential contamination in your benchmark, we have a script to identify potential contaminated test samples in your data:
Set up your benchmark in `utils.py`, this requires you to specify how to load your benchmark and verbalization methods, etc.
Then run the following to produce contamination reports for your benchmark:
```
python search.py
```
To run this script, you will need a free access token for Bing search API. You could obtain one via [this](https://www.microsoft.com/en-us/bing/apis/bing-web-search-api). A free access key allow 1000 calls monthly. Student will receive $100 funding if you're creating a new account.
Set the key via `export Bing_Key = [YOUR API KEY]` in terminal.
`search.py` will generate a report under `reports/` such as `reports/mmlu_report.json` that highlight all matches online, for example:
```
[
{
"input": "The economy is in a deep recession. Given this economic situation which of the following statements about monetary policy is accurate?",
"match_string": "The economy is in a deep recession. Given this economic situation, which of the following statements about monetary policy is accurate policy recession policy",
"score": 0.900540825748582,
"name": "<b>AP Macroeconomics Question 445: Answer and Explanation</b> - CrackAP.com",
"contaminated_url": "https://www.crackap.com/ap/macroeconomics/question-445-answer-and-explanation.html",
},
...
```
Reports for six popular multi-choice QA benchmarks are ready to access under `/reports`.
To visualize the results, please move to [visualize](https://github.com/liyucheng09/Contamination_Detector/tree/master/visualize).
**Check contamination examples: MMLU at [here](https://htmlpreview.github.io/?https://github.com/liyucheng09/Contamination_Detector/blob/master/reports/mmlu.html), and C-Eval at [here](https://htmlpreview.github.io/?https://github.com/liyucheng09/Contamination_Detector/blob/master/reports/ceval.html)**
If you cannot accessing Huggingface Hub for the benchmark datasets, download them as json files [here](https://github.com/liyucheng09/Contamination_Detector/releases/tag/v0.1.1).
## Citation:
Consider cite our project if you find it helpful:
```
@article{Li2023AnOS,
title={An Open Source Data Contamination Report for Large Language Models},
author={Yucheng Li},
journal={ArXiv},
year={2023},
volume={abs/2310.17589},
}
```
## Issues
Open an issue or contact me via email if you encounter any problems in your use.
<p align="center"><img src="https://github.com/liyucheng09/Contamination_Detector/blob/master/pics/logo.png" alt="污染检测器的标志(Logo of Contamination Detector)" width="auto" height="160" /></p>
# 用于大语言模型(LLMs)评估的污染检测器(Contamination Detector)
数据污染(Data Contamination)是大语言模型(Large Language Model,LLM)评估中普遍存在且至关重要的问题。我们的**污染检测器(Contamination Detector)**旨在无需访问LLMs训练数据的前提下,识别并分析潜在的污染问题,助力社区对LLMs评估结果进行审计与开展可靠评估。
**最新动态!**
- 我们的新预印本:[开源大语言模型数据污染报告](https://arxiv.org/abs/2310.17589)!
# 我们的方法:通过搜索引擎检测潜在污染
污染检测器通过必应搜索(Bing search)和通用爬虫索引(Common Crawl index)检测测试样本是否出现在互联网上。我们将测试样本划分为三个子集:
1. **干净(Clean)集**:问题与参考答案均未在网络上出现。
2. **仅输入污染(Input-only contaminated)集**:问题已在网络上出现,但对应的答案未出现。
3. **输入与标签污染(Input-and-label contaminated)集**:问题与答案均已在网络上出现。
若测试样本的“问题”或“答案”任一被检索到出现在网络中,则该样本可能已被包含在LLMs的训练数据中。此时,LLMs可能通过“记忆”这些样本获得**不公平优势**,而非真正地**理解或解决问题**。
目前我们支持以下主流LLMs基准测试集:
- MMLU
- CEval
- Winogrande
- ARC
- Hellaswag
- CommonsenseQA
# 快速开始:检测LLMs的数据污染程度
1. 克隆仓库并安装依赖包:
git clone https://github.com/liyucheng09/Contamination_Detector.git
cd Contamination_Detector/
pip install -r requirements.txt
2. 我们需要模型预测结果以进一步分析其数据污染问题。我们已为以下LLMs准备好模型预测结果:
- LLaMA 7、13、30、65B
- Llama-2 7、13、70B
- Qwen-7b
- Baichuan2-7B
- Mistral-7B
- Mistral Instruct 7B
- Yi 6B
你可直接下载这些预测结果,无需自行进行模型推理:
wget https://github.com/liyucheng09/Contamination_Detector/releases/download/v0.1.1rc2/model_predictions.zip
unzip model_predictions.zip
若你希望针对自定义的预测数据开展分析,请将你的模型预测结果按照如下格式组织,并放置于`model_predictions/`目录下:
{
"mmlu": {
"business_ethics 0": {
"gold": "C",
"pred": "A"
},
"business_ethics 1": {
"gold": "B",
"pred": "A"
},
"business_ethics 2": {
"gold": "D",
"pred": "A"
},
"business_ethics 3": {
"gold": "D",
"pred": "D"
},
"business_ethics 4": {
"gold": "B",
"pred": "B"
},
.....
3. 生成污染分析表格:
python clean_dirty_comparison.py
该脚本将利用`reports/`目录下的污染标注数据,生成模型在干净集、仅输入污染集以及输入与标签污染集上的性能表现。
可参考Llama-2 70B在三个子集上的性能差异示例:
| 数据集 | 条件 | Llama-2 70B |
|-----------|--------------------|--------------|
| MMLU | 干净集 | 0.6763 |
| MMLU | 全部污染集 | 0.6667 ↓ |
| MMLU | 输入与标签污染集 | 0.7093 ↑ |
| Hellaswag | 干净集 | 0.7726 |
| Hellaswag | 全部污染集 | 0.8348 ↑ |
| Hellaswag | 输入与标签污染集 | 0.8455 ↑ |
| ARC | 干净集 | 0.4555 |
| ARC | 全部污染集 | 0.5632 ↑ |
| ARC | 输入与标签污染集 | 0.5667 ↑ |
| 平均值 | 干净集 | 0.6348 |
| 平均值 | 全部污染集 | 0.6882 ↑ |
| 平均值 | 输入与标签污染集 | 0.7072 ↑ |
除上述表格外,`clean_dirty_comparison.py`还将生成一张图表,展示模型性能随召回分数(样本污染程度的量化指标)的变化情况。
# 审计你自有评估数据
若需检测你自有基准测试集中的潜在污染问题,我们提供了脚本用于识别你的数据中存在潜在污染的测试样本:
首先在`utils.py`中配置你的基准测试集,这需要你指定如何加载该基准集以及提示模板(verbalization methods)等参数。
随后运行以下命令为你的基准测试集生成污染报告:
python search.py
运行该脚本需要必应搜索API的免费访问令牌(access token),你可通过[此链接](https://www.microsoft.com/en-us/bing/apis/bing-web-search-api)申请。免费密钥每月可调用1000次,新注册学生可获得100美元额度。
请在终端中通过`export Bing_Key = [YOUR API KEY]`设置你的API密钥。
`search.py`将在`reports/`目录下生成报告,例如`reports/mmlu_report.json`,该报告将高亮显示所有匹配到的网络内容,示例如下:
[
{
"input": "当前经济深陷衰退。结合这一经济形势,以下哪一项关于货币政策的表述是准确的?",
"match_string": "当前经济深陷衰退。结合这一经济形势,以下哪一项关于货币政策的表述是准确的?(政策衰退政策)",
"score": 0.900540825748582,
"name": "<b>AP宏观经济学第445题:答案与解析</b> - CrackAP.com",
"contaminated_url": "https://www.crackap.com/ap/macroeconomics/question-445-answer-and-explanation.html",
},
...
六大主流多选问答基准测试集的报告已可在`/reports`目录下获取。
如需可视化展示结果,请跳转至[visualize](https://github.com/liyucheng09/Contamination_Detector/tree/master/visualize)目录。
**查看污染示例:MMLU数据集请点击[此处](https://htmlpreview.github.io/?https://github.com/liyucheng09/Contamination_Detector/blob/master/reports/mmlu.html),C-Eval数据集请点击[此处](https://htmlpreview.github.io/?https://github.com/liyucheng09/Contamination_Detector/blob/master/reports/ceval.html)**
若无法通过Huggingface Hub下载基准数据集,可通过[此链接](https://github.com/liyucheng09/Contamination_Detector/releases/tag/v0.1.1)以JSON文件形式下载。
## 引用
若你认为本项目对你有所帮助,请引用我们的工作:
@article{Li2023AnOS,
title={An Open Source Data Contamination Report for Large Language Models},
author={Yucheng Li},
journal={ArXiv},
year={2023},
volume={abs/2310.17589},
}
## 问题反馈
若你在使用过程中遇到任何问题,请提交Issue或通过邮件联系我。
提供机构:
maas
创建时间:
2024-07-04



