five

Contamination_Detector

收藏
魔搭社区2025-11-25 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/opencompass/Contamination_Detector
下载链接
链接失效反馈
官方服务:
资源简介:
<p align="center"> <img src="https://github.com/liyucheng09/Contamination_Detector/blob/master/pics/logo.png" alt="Logo of Contamination Detector" width="auto" height="160" /> </p> # Contamination Detector for LLMs Evaluation Data Contamination is a pervasive and critical issue in the evaluation of Large Language Models (LLMs). Our **Contamination Detector** is designed to identify and analyze potential contamination issues without needing access to the LLMs' training data, enabling the community to audit LLMs evaluation results and conduct robust evaluation. **News!!** - Our new preprint: [An open source data contamination report for large language models](https://arxiv.org/abs/2310.17589)! # Our Methods: check potential contamination via search engine Contamination Detector checks whether test examples appear on the internet via **Bing search** and **Common Crawl index**. We categorize test samples into three subsets: 1. **Clean** set: the question and reference answer do not appear online. 2. **Input-only contaminated** set: the question appears online, but not its answer. 3. **Input-and-label contaminated** set: both question and answer appear online. If either the "question" or "answer" of a test example is found online, this sample may have been included in the LLM's training data. As a result, LLMs might gain an **unfair advantage by 'remembering' these samples**, rather than genuinely **understanding or solving them**. We now support the following popular LLMs benchmarks: - MMLU - CEval - Winogrande - ARC - Hellaswag - CommonsenseQA # Get start: Test LLMs' degree of contamination 1. Clone the repository and install the required packages: ``` git clone https://github.com/liyucheng09/Contamination_Detector.git cd Contamination_Detector/ pip install -r requirements.txt ``` 2. We need model predictions to further analyze their data contamination issue. We have prepared model predictions for the following LLMs: - LLaMA 7,13,30,65B - Llama-2 7,13,70B - Qwen-7b - Baichuan2-7B - Mistral-7B - Mistral Instruct 7B - Yi 6B That you can download directly without going through the inference: ``` wget https://github.com/liyucheng09/Contamination_Detector/releases/download/v0.1.1rc2/model_predictions.zip unzip model_predictions.zip ``` If you hope to conduct the analysis on your own prediction data, format your model prediction as following and put under `model_predictions/`: ``` { "mmlu": { "business_ethics 0": { "gold": "C", "pred": "A" }, "business_ethics 1": { "gold": "B", "pred": "A" }, "business_ethics 2": { "gold": "D", "pred": "A" }, "business_ethics 3": { "gold": "D", "pred": "D" }, "business_ethics 4": { "gold": "B", "pred": "B" }, ..... ``` 3. Generate contamination analysis table: ``` python clean_dirty_comparison.py ``` This will use the contamination annotation under `reports/` to generate models' performance on the clean, input-only contaminated, and input-and-label contaminated subsets. See how the performance of Llama-2 70B differs on the three subsets. | Dataset | Condition | Llama-2 70B | |-----------|--------------------|--------------| | MMLU | Clean | .6763 | | MMLU | All Dirty | .6667 ↓ | | MMLU | Input-label Dirty | .7093 ↑ | | Hellaswag | Clean | .7726 | | Hellaswag | All Dirty | .8348 ↑ | | Hellaswag | Input-label Dirty | .8455 ↑ | | ARC | Clean | .4555 | | ARC | All Dirty | .5632 ↑ | | ARC | Input-label Dirty | .5667 ↑ | | Average | Clean | .6348 | | Average | All Dirty | .6882 ↑ | | Average | Input-label Dirty | .7072 ↑ | Other than this table, `clean_dirty_comparison.py` also produces a figure illustrating how the performance change with the recall score (the extent of contamination for a sample). # Audit your own evaluation data To check potential contamination in your benchmark, we have a script to identify potential contaminated test samples in your data: Set up your benchmark in `utils.py`, this requires you to specify how to load your benchmark and verbalization methods, etc. Then run the following to produce contamination reports for your benchmark: ``` python search.py ``` To run this script, you will need a free access token for Bing search API. You could obtain one via [this](https://www.microsoft.com/en-us/bing/apis/bing-web-search-api). A free access key allow 1000 calls monthly. Student will receive $100 funding if you're creating a new account. Set the key via `export Bing_Key = [YOUR API KEY]` in terminal. `search.py` will generate a report under `reports/` such as `reports/mmlu_report.json` that highlight all matches online, for example: ``` [ { "input": "The economy is in a deep recession. Given this economic situation which of the following statements about monetary policy is accurate?", "match_string": "The economy is in a deep recession. Given this economic situation, which of the following statements about monetary policy is accurate policy recession policy", "score": 0.900540825748582, "name": "<b>AP Macroeconomics Question 445: Answer and Explanation</b> - CrackAP.com", "contaminated_url": "https://www.crackap.com/ap/macroeconomics/question-445-answer-and-explanation.html", }, ... ``` Reports for six popular multi-choice QA benchmarks are ready to access under `/reports`. To visualize the results, please move to [visualize](https://github.com/liyucheng09/Contamination_Detector/tree/master/visualize). **Check contamination examples: MMLU at [here](https://htmlpreview.github.io/?https://github.com/liyucheng09/Contamination_Detector/blob/master/reports/mmlu.html), and C-Eval at [here](https://htmlpreview.github.io/?https://github.com/liyucheng09/Contamination_Detector/blob/master/reports/ceval.html)** If you cannot accessing Huggingface Hub for the benchmark datasets, download them as json files [here](https://github.com/liyucheng09/Contamination_Detector/releases/tag/v0.1.1). ## Citation: Consider cite our project if you find it helpful: ``` @article{Li2023AnOS, title={An Open Source Data Contamination Report for Large Language Models}, author={Yucheng Li}, journal={ArXiv}, year={2023}, volume={abs/2310.17589}, } ``` ## Issues Open an issue or contact me via email if you encounter any problems in your use.

<p align="center"><img src="https://github.com/liyucheng09/Contamination_Detector/blob/master/pics/logo.png" alt="污染检测器的标志(Logo of Contamination Detector)" width="auto" height="160" /></p> # 用于大语言模型(LLMs)评估的污染检测器(Contamination Detector) 数据污染(Data Contamination)是大语言模型(Large Language Model,LLM)评估中普遍存在且至关重要的问题。我们的**污染检测器(Contamination Detector)**旨在无需访问LLMs训练数据的前提下,识别并分析潜在的污染问题,助力社区对LLMs评估结果进行审计与开展可靠评估。 **最新动态!** - 我们的新预印本:[开源大语言模型数据污染报告](https://arxiv.org/abs/2310.17589)! # 我们的方法:通过搜索引擎检测潜在污染 污染检测器通过必应搜索(Bing search)和通用爬虫索引(Common Crawl index)检测测试样本是否出现在互联网上。我们将测试样本划分为三个子集: 1. **干净(Clean)集**:问题与参考答案均未在网络上出现。 2. **仅输入污染(Input-only contaminated)集**:问题已在网络上出现,但对应的答案未出现。 3. **输入与标签污染(Input-and-label contaminated)集**:问题与答案均已在网络上出现。 若测试样本的“问题”或“答案”任一被检索到出现在网络中,则该样本可能已被包含在LLMs的训练数据中。此时,LLMs可能通过“记忆”这些样本获得**不公平优势**,而非真正地**理解或解决问题**。 目前我们支持以下主流LLMs基准测试集: - MMLU - CEval - Winogrande - ARC - Hellaswag - CommonsenseQA # 快速开始:检测LLMs的数据污染程度 1. 克隆仓库并安装依赖包: git clone https://github.com/liyucheng09/Contamination_Detector.git cd Contamination_Detector/ pip install -r requirements.txt 2. 我们需要模型预测结果以进一步分析其数据污染问题。我们已为以下LLMs准备好模型预测结果: - LLaMA 7、13、30、65B - Llama-2 7、13、70B - Qwen-7b - Baichuan2-7B - Mistral-7B - Mistral Instruct 7B - Yi 6B 你可直接下载这些预测结果,无需自行进行模型推理: wget https://github.com/liyucheng09/Contamination_Detector/releases/download/v0.1.1rc2/model_predictions.zip unzip model_predictions.zip 若你希望针对自定义的预测数据开展分析,请将你的模型预测结果按照如下格式组织,并放置于`model_predictions/`目录下: { "mmlu": { "business_ethics 0": { "gold": "C", "pred": "A" }, "business_ethics 1": { "gold": "B", "pred": "A" }, "business_ethics 2": { "gold": "D", "pred": "A" }, "business_ethics 3": { "gold": "D", "pred": "D" }, "business_ethics 4": { "gold": "B", "pred": "B" }, ..... 3. 生成污染分析表格: python clean_dirty_comparison.py 该脚本将利用`reports/`目录下的污染标注数据,生成模型在干净集、仅输入污染集以及输入与标签污染集上的性能表现。 可参考Llama-2 70B在三个子集上的性能差异示例: | 数据集 | 条件 | Llama-2 70B | |-----------|--------------------|--------------| | MMLU | 干净集 | 0.6763 | | MMLU | 全部污染集 | 0.6667 ↓ | | MMLU | 输入与标签污染集 | 0.7093 ↑ | | Hellaswag | 干净集 | 0.7726 | | Hellaswag | 全部污染集 | 0.8348 ↑ | | Hellaswag | 输入与标签污染集 | 0.8455 ↑ | | ARC | 干净集 | 0.4555 | | ARC | 全部污染集 | 0.5632 ↑ | | ARC | 输入与标签污染集 | 0.5667 ↑ | | 平均值 | 干净集 | 0.6348 | | 平均值 | 全部污染集 | 0.6882 ↑ | | 平均值 | 输入与标签污染集 | 0.7072 ↑ | 除上述表格外,`clean_dirty_comparison.py`还将生成一张图表,展示模型性能随召回分数(样本污染程度的量化指标)的变化情况。 # 审计你自有评估数据 若需检测你自有基准测试集中的潜在污染问题,我们提供了脚本用于识别你的数据中存在潜在污染的测试样本: 首先在`utils.py`中配置你的基准测试集,这需要你指定如何加载该基准集以及提示模板(verbalization methods)等参数。 随后运行以下命令为你的基准测试集生成污染报告: python search.py 运行该脚本需要必应搜索API的免费访问令牌(access token),你可通过[此链接](https://www.microsoft.com/en-us/bing/apis/bing-web-search-api)申请。免费密钥每月可调用1000次,新注册学生可获得100美元额度。 请在终端中通过`export Bing_Key = [YOUR API KEY]`设置你的API密钥。 `search.py`将在`reports/`目录下生成报告,例如`reports/mmlu_report.json`,该报告将高亮显示所有匹配到的网络内容,示例如下: [ { "input": "当前经济深陷衰退。结合这一经济形势,以下哪一项关于货币政策的表述是准确的?", "match_string": "当前经济深陷衰退。结合这一经济形势,以下哪一项关于货币政策的表述是准确的?(政策衰退政策)", "score": 0.900540825748582, "name": "<b>AP宏观经济学第445题:答案与解析</b> - CrackAP.com", "contaminated_url": "https://www.crackap.com/ap/macroeconomics/question-445-answer-and-explanation.html", }, ... 六大主流多选问答基准测试集的报告已可在`/reports`目录下获取。 如需可视化展示结果,请跳转至[visualize](https://github.com/liyucheng09/Contamination_Detector/tree/master/visualize)目录。 **查看污染示例:MMLU数据集请点击[此处](https://htmlpreview.github.io/?https://github.com/liyucheng09/Contamination_Detector/blob/master/reports/mmlu.html),C-Eval数据集请点击[此处](https://htmlpreview.github.io/?https://github.com/liyucheng09/Contamination_Detector/blob/master/reports/ceval.html)** 若无法通过Huggingface Hub下载基准数据集,可通过[此链接](https://github.com/liyucheng09/Contamination_Detector/releases/tag/v0.1.1)以JSON文件形式下载。 ## 引用 若你认为本项目对你有所帮助,请引用我们的工作: @article{Li2023AnOS, title={An Open Source Data Contamination Report for Large Language Models}, author={Yucheng Li}, journal={ArXiv}, year={2023}, volume={abs/2310.17589}, } ## 问题反馈 若你在使用过程中遇到任何问题,请提交Issue或通过邮件联系我。
提供机构:
maas
创建时间:
2024-07-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作