Contamination_Detector

Name: Contamination_Detector
Creator: maas
Published: 2025-11-25 16:15:53
License: 暂无描述

魔搭社区2025-11-25 更新2025-06-14 收录

下载链接：

https://modelscope.cn/datasets/opencompass/Contamination_Detector

下载链接

链接失效反馈

官方服务：

资源简介：

<img src="https://github.com/liyucheng09/Contamination_Detector/blob/master/pics/logo.png" alt="Logo of Contamination Detector" width="auto" height="160" /> # Contamination Detector for LLMs Evaluation Data Contamination is a pervasive and critical issue in the evaluation of Large Language Models (LLMs). Our **Contamination Detector** is designed to identify and analyze potential contamination issues without needing access to the LLMs' training data, enabling the community to audit LLMs evaluation results and conduct robust evaluation. **News!!** - Our new preprint: [An open source data contamination report for large language models](https://arxiv.org/abs/2310.17589)! # Our Methods: check potential contamination via search engine Contamination Detector checks whether test examples appear on the internet via **Bing search** and **Common Crawl index**. We categorize test samples into three subsets: 1. **Clean** set: the question and reference answer do not appear online. 2. **Input-only contaminated** set: the question appears online, but not its answer. 3. **Input-and-label contaminated** set: both question and answer appear online. If either the "question" or "answer" of a test example is found online, this sample may have been included in the LLM's training data. As a result, LLMs might gain an **unfair advantage by 'remembering' these samples**, rather than genuinely **understanding or solving them**. We now support the following popular LLMs benchmarks: - MMLU - CEval - Winogrande - ARC - Hellaswag - CommonsenseQA # Get start: Test LLMs' degree of contamination 1. Clone the repository and install the required packages: ``` git clone https://github.com/liyucheng09/Contamination_Detector.git cd Contamination_Detector/ pip install -r requirements.txt ``` 2. We need model predictions to further analyze their data contamination issue. We have prepared model predictions for the following LLMs: - LLaMA 7,13,30,65B - Llama-2 7,13,70B - Qwen-7b - Baichuan2-7B - Mistral-7B - Mistral Instruct 7B - Yi 6B That you can download directly without going through the inference: ``` wget https://github.com/liyucheng09/Contamination_Detector/releases/download/v0.1.1rc2/model_predictions.zip unzip model_predictions.zip ``` If you hope to conduct the analysis on your own prediction data, format your model prediction as following and put under `model_predictions/`: ``` { "mmlu": { "business_ethics 0": { "gold": "C", "pred": "A" }, "business_ethics 1": { "gold": "B", "pred": "A" }, "business_ethics 2": { "gold": "D", "pred": "A" }, "business_ethics 3": { "gold": "D", "pred": "D" }, "business_ethics 4": { "gold": "B", "pred": "B" }, ..... ``` 3. Generate contamination analysis table: ``` python clean_dirty_comparison.py ``` This will use the contamination annotation under `reports/` to generate models' performance on the clean, input-only contaminated, and input-and-label contaminated subsets. See how the performance of Llama-2 70B differs on the three subsets. | Dataset | Condition | Llama-2 70B | |-----------|--------------------|--------------| | MMLU | Clean | .6763 | | MMLU | All Dirty | .6667 ↓ | | MMLU | Input-label Dirty | .7093 ↑ | | Hellaswag | Clean | .7726 | | Hellaswag | All Dirty | .8348 ↑ | | Hellaswag | Input-label Dirty | .8455 ↑ | | ARC | Clean | .4555 | | ARC | All Dirty | .5632 ↑ | | ARC | Input-label Dirty | .5667 ↑ | | Average | Clean | .6348 | | Average | All Dirty | .6882 ↑ | | Average | Input-label Dirty | .7072 ↑ | Other than this table, `clean_dirty_comparison.py` also produces a figure illustrating how the performance change with the recall score (the extent of contamination for a sample). # Audit your own evaluation data To check potential contamination in your benchmark, we have a script to identify potential contaminated test samples in your data: Set up your benchmark in `utils.py`, this requires you to specify how to load your benchmark and verbalization methods, etc. Then run the following to produce contamination reports for your benchmark: ``` python search.py ``` To run this script, you will need a free access token for Bing search API. You could obtain one via [this](https://www.microsoft.com/en-us/bing/apis/bing-web-search-api). A free access key allow 1000 calls monthly. Student will receive $100 funding if you're creating a new account. Set the key via `export Bing_Key = [YOUR API KEY]` in terminal. `search.py` will generate a report under `reports/` such as `reports/mmlu_report.json` that highlight all matches online, for example: ``` [ { "input": "The economy is in a deep recession. Given this economic situation which of the following statements about monetary policy is accurate?", "match_string": "The economy is in a deep recession. Given this economic situation, which of the following statements about monetary policy is accurate policy recession policy", "score": 0.900540825748582, "name": "AP Macroeconomics Question 445: Answer and Explanation - CrackAP.com", "contaminated_url": "https://www.crackap.com/ap/macroeconomics/question-445-answer-and-explanation.html", }, ... ``` Reports for six popular multi-choice QA benchmarks are ready to access under `/reports`. To visualize the results, please move to [visualize](https://github.com/liyucheng09/Contamination_Detector/tree/master/visualize). **Check contamination examples: MMLU at [here](https://htmlpreview.github.io/?https://github.com/liyucheng09/Contamination_Detector/blob/master/reports/mmlu.html), and C-Eval at [here](https://htmlpreview.github.io/?https://github.com/liyucheng09/Contamination_Detector/blob/master/reports/ceval.html)** If you cannot accessing Huggingface Hub for the benchmark datasets, download them as json files [here](https://github.com/liyucheng09/Contamination_Detector/releases/tag/v0.1.1). ## Citation: Consider cite our project if you find it helpful: ``` @article{Li2023AnOS, title={An Open Source Data Contamination Report for Large Language Models}, author={Yucheng Li}, journal={ArXiv}, year={2023}, volume={abs/2310.17589}, } ``` ## Issues Open an issue or contact me via email if you encounter any problems in your use.

<img src="https://github.com/liyucheng09/Contamination_Detector/blob/master/pics/logo.png" alt="污染检测器的标志（Logo of Contamination Detector）" width="auto" height="160" /> # 用于大语言模型（LLMs）评估的污染检测器（Contamination Detector）数据污染（Data Contamination）是大语言模型（Large Language Model，LLM）评估中普遍存在且至关重要的问题。我们的**污染检测器（Contamination Detector）**旨在无需访问LLMs训练数据的前提下，识别并分析潜在的污染问题，助力社区对LLMs评估结果进行审计与开展可靠评估。 **最新动态！** - 我们的新预印本：[开源大语言模型数据污染报告](https://arxiv.org/abs/2310.17589)！ # 我们的方法：通过搜索引擎检测潜在污染污染检测器通过必应搜索（Bing search）和通用爬虫索引（Common Crawl index）检测测试样本是否出现在互联网上。我们将测试样本划分为三个子集： 1. **干净（Clean）集**：问题与参考答案均未在网络上出现。 2. **仅输入污染（Input-only contaminated）集**：问题已在网络上出现，但对应的答案未出现。 3. **输入与标签污染（Input-and-label contaminated）集**：问题与答案均已在网络上出现。若测试样本的“问题”或“答案”任一被检索到出现在网络中，则该样本可能已被包含在LLMs的训练数据中。此时，LLMs可能通过“记忆”这些样本获得**不公平优势**，而非真正地**理解或解决问题**。目前我们支持以下主流LLMs基准测试集： - MMLU - CEval - Winogrande - ARC - Hellaswag - CommonsenseQA # 快速开始：检测LLMs的数据污染程度 1. 克隆仓库并安装依赖包： git clone https://github.com/liyucheng09/Contamination_Detector.git cd Contamination_Detector/ pip install -r requirements.txt 2. 我们需要模型预测结果以进一步分析其数据污染问题。我们已为以下LLMs准备好模型预测结果： - LLaMA 7、13、30、65B - Llama-2 7、13、70B - Qwen-7b - Baichuan2-7B - Mistral-7B - Mistral Instruct 7B - Yi 6B 你可直接下载这些预测结果，无需自行进行模型推理： wget https://github.com/liyucheng09/Contamination_Detector/releases/download/v0.1.1rc2/model_predictions.zip unzip model_predictions.zip 若你希望针对自定义的预测数据开展分析，请将你的模型预测结果按照如下格式组织，并放置于`model_predictions/`目录下： { "mmlu": { "business_ethics 0": { "gold": "C", "pred": "A" }, "business_ethics 1": { "gold": "B", "pred": "A" }, "business_ethics 2": { "gold": "D", "pred": "A" }, "business_ethics 3": { "gold": "D", "pred": "D" }, "business_ethics 4": { "gold": "B", "pred": "B" }, ..... 3. 生成污染分析表格： python clean_dirty_comparison.py 该脚本将利用`reports/`目录下的污染标注数据，生成模型在干净集、仅输入污染集以及输入与标签污染集上的性能表现。可参考Llama-2 70B在三个子集上的性能差异示例： | 数据集 | 条件 | Llama-2 70B | |-----------|--------------------|--------------| | MMLU | 干净集 | 0.6763 | | MMLU | 全部污染集 | 0.6667 ↓ | | MMLU | 输入与标签污染集 | 0.7093 ↑ | | Hellaswag | 干净集 | 0.7726 | | Hellaswag | 全部污染集 | 0.8348 ↑ | | Hellaswag | 输入与标签污染集 | 0.8455 ↑ | | ARC | 干净集 | 0.4555 | | ARC | 全部污染集 | 0.5632 ↑ | | ARC | 输入与标签污染集 | 0.5667 ↑ | | 平均值 | 干净集 | 0.6348 | | 平均值 | 全部污染集 | 0.6882 ↑ | | 平均值 | 输入与标签污染集 | 0.7072 ↑ | 除上述表格外，`clean_dirty_comparison.py`还将生成一张图表，展示模型性能随召回分数（样本污染程度的量化指标）的变化情况。 # 审计你自有评估数据若需检测你自有基准测试集中的潜在污染问题，我们提供了脚本用于识别你的数据中存在潜在污染的测试样本：首先在`utils.py`中配置你的基准测试集，这需要你指定如何加载该基准集以及提示模板（verbalization methods）等参数。随后运行以下命令为你的基准测试集生成污染报告： python search.py 运行该脚本需要必应搜索API的免费访问令牌（access token），你可通过[此链接](https://www.microsoft.com/en-us/bing/apis/bing-web-search-api)申请。免费密钥每月可调用1000次，新注册学生可获得100美元额度。请在终端中通过`export Bing_Key = [YOUR API KEY]`设置你的API密钥。 `search.py`将在`reports/`目录下生成报告，例如`reports/mmlu_report.json`，该报告将高亮显示所有匹配到的网络内容，示例如下： [ { "input": "当前经济深陷衰退。结合这一经济形势，以下哪一项关于货币政策的表述是准确的？", "match_string": "当前经济深陷衰退。结合这一经济形势，以下哪一项关于货币政策的表述是准确的？（政策衰退政策）", "score": 0.900540825748582, "name": "AP宏观经济学第445题：答案与解析 - CrackAP.com", "contaminated_url": "https://www.crackap.com/ap/macroeconomics/question-445-answer-and-explanation.html", }, ... 六大主流多选问答基准测试集的报告已可在`/reports`目录下获取。如需可视化展示结果，请跳转至[visualize](https://github.com/liyucheng09/Contamination_Detector/tree/master/visualize)目录。 **查看污染示例：MMLU数据集请点击[此处](https://htmlpreview.github.io/?https://github.com/liyucheng09/Contamination_Detector/blob/master/reports/mmlu.html)，C-Eval数据集请点击[此处](https://htmlpreview.github.io/?https://github.com/liyucheng09/Contamination_Detector/blob/master/reports/ceval.html)** 若无法通过Huggingface Hub下载基准数据集，可通过[此链接](https://github.com/liyucheng09/Contamination_Detector/releases/tag/v0.1.1)以JSON文件形式下载。 ## 引用若你认为本项目对你有所帮助，请引用我们的工作： @article{Li2023AnOS, title={An Open Source Data Contamination Report for Large Language Models}, author={Yucheng Li}, journal={ArXiv}, year={2023}, volume={abs/2310.17589}, } ## 问题反馈若你在使用过程中遇到任何问题，请提交Issue或通过邮件联系我。

提供机构：

maas

创建时间：

2024-07-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集