MMLU-CF
收藏魔搭社区2025-12-05 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/microsoft/MMLU-CF
下载链接
链接失效反馈官方服务:
资源简介:
# MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark
<p align="left">
<a href="https://arxiv.org/pdf/2412.15194"><b>[📜 Paper]</b></a> •
<a href="https://huggingface.co/datasets/microsoft/MMLU-CF"><b>[🤗 HF Dataset]</b></a> •
<a href="https://github.com/microsoft/MMLU-CF"><b>[🐱 GitHub]</b></a>
</p>
MMLU-CF is a contamination-free and more challenging multiple-choice question benchmark. This dataset contains 10K questions each for the validation set and test set, covering various disciplines.
## 1. The Motivation of MMLU-CF
- The open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. To alleviate this issue, we propose MMLU-CF.
- (a) An instance of leakage in MMLU. When questions are used as prompt from the MMLU, certain LLMs, due to their memorization capabilities, directly provide **choices identical to the original ones**. (b) When questions are used as prompt from the MMLU-CF, LLMs only provide guessed choices.
This indicates that the MMLU test set suffers from data contamination and memorization by some LLMs, while the proposed MMLU-CF avoids such leakage.
<img src="./Figures/Fig_1_a.png" alt="Fig1_a" width="60%" />
<img src="./Figures/Fig_1_b.png" alt="Fig1_b" width="60%" />
## 2. How to Evaluate Your Models
Please refer to the [MMLU-CF GitHub Page](https://github.com/microsoft/MMLU-CF) for detailed guidance.
## 3. Data Construction Pipeline

The pipeline involves (1) MCQ Collection to gather a diverse set of questions; (2) MCQ Cleaning to ensure quality; (3) Difficulty Sampling to ensure an appropriate difficulty distribution for questions; (4) LLMs checking: The LLMs, including GPT-4o, Gemini, and Claude, are reviewing the accuracy and safety of the data; and (5) Contamination-Free Processing to prevent data leakage and maintain dataset purity. Ultimately, this process results in the MMLU-CF, consisting of 10,000 questions for the closed-source test set and 10,000 for the open-source validation set.
## 4. What is the Difference between MMLU-CF and MMLU
MMLU focuses on the breadth and reasoning without considering contamination prevention. We apply three decontamination rules to mitigate unintentional data leakage while collecting data from a broader domain. Meanwhile, our MMLU-CF benchmark maintains the test set as a closed source to prevent malicious data leakage.
<img src="./Figures/Fig_4.png" alt="Fig4" width="60%" />
## 5. Contact
For any inquiries or concerns, feel free to reach out to us via Email: [Qihao Zhao](qhzhaoo@gmail.com) and [Yangyu Huang](yanghuan@microsoft.com).
## 6. Citation
```
@misc{zhao2024mmlucfcontaminationfreemultitasklanguage,
title={MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark},
author={Qihao Zhao and Yangyu Huang and Tengchao Lv and Lei Cui and Qinzheng Sun and Shaoguang Mao and Xin Zhang and Ying Xin and Qiufeng Yin and Scarlett Li and Furu Wei},
year={2024},
eprint={2412.15194},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.15194},
}
```
## 7. License
This dataset is licensed under the [CDLA-2.0](https://cdla.dev/permissive-2-0/) License.
# MMLU-CF:无污染多任务语言理解基准数据集
<p align="left">
<a href="https://arxiv.org/pdf/2412.15194"><b>[📜 论文]</b></a> •
<a href="https://huggingface.co/datasets/microsoft/MMLU-CF"><b>[🤗 HF 数据集]</b></a> •
<a href="https://github.com/microsoft/MMLU-CF"><b>[🐱 GitHub 仓库]</b></a>
</p>
MMLU-CF是一款无污染且更具挑战性的多项选择题基准数据集。该数据集的验证集与测试集各包含10000道题目,覆盖众多学科领域。
## 1. MMLU-CF 的研发动机
- 现有基准数据集均为开源形式,且大语言模型(Large Language Model,LLM)的训练数据来源广泛,这不可避免地引发了基准污染问题,导致评估结果不可靠。为缓解这一问题,我们提出了MMLU-CF数据集。
- (a) MMLU中的数据泄露示例:当将MMLU的题目用作提示时,部分大语言模型凭借其记忆能力,会直接输出**与原题完全一致的选项**。(b) 当将MMLU-CF的题目用作提示时,大语言模型仅能给出猜测所得的选项。这表明MMLU测试集存在数据污染以及被部分大语言模型记忆的问题,而我们提出的MMLU-CF则规避了此类数据泄露。
<img src="./Figures/Fig_1_a.png" alt="Fig1_a" width="60%" />
<img src="./Figures/Fig_1_b.png" alt="Fig1_b" width="60%" />
## 2. 模型评估指南
详细评估流程请参阅<a href="https://github.com/microsoft/MMLU-CF"><b>MMLU-CF GitHub 仓库</b></a>。
## 3. 数据构建流程
<img src="./Figures/Fig_3.png" alt="Fig3" width="60%" />
该流程包含以下五个步骤:
1. 多项选择题收集:采集多样化的题目集合;
2. 多项选择题清洗:确保题目质量达标;
3. 难度抽样:保证题目具备合理的难度分布;
4. 大语言模型校验:由GPT-4o、Gemini、Claude等大语言模型对数据的准确性与安全性进行审核;
5. 无污染处理:防范数据泄露,维持数据集的纯净性。
最终,本流程产出了MMLU-CF数据集:闭源测试集与开源验证集各包含10000道题目。
## 4. MMLU-CF与MMLU的差异
MMLU仅关注任务广度与推理能力,未考虑污染防控。我们在从更广泛的领域采集数据的同时,引入了三条去污染规则以缓解非故意的数据泄露问题。此外,我们将MMLU-CF基准数据集的测试集设置为闭源形式,以防范恶意的数据泄露。
<img src="./Figures/Fig_4.png" alt="Fig4" width="60%" />
## 5. 联系方式
如有任何疑问或建议,可通过以下邮箱联系我们:[赵启浩](qhzhaoo@gmail.com) 与 [黄杨宇](yanghuan@microsoft.com)。
## 6. 引用格式
@misc{zhao2024mmlucfcontaminationfreemultitasklanguage,
title={MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark},
author={Qihao Zhao and Yangyu Huang and Tengchao Lv and Lei Cui and Qinzheng Sun and Shaoguang Mao and Xin Zhang and Ying Xin and Qiufeng Yin and Scarlett Li and Furu Wei},
year={2024},
eprint={2412.15194},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.15194},
}
## 7. 许可协议
本数据集采用<a href="https://cdla.dev/permissive-2-0/">CDLA-2.0</a>许可协议进行授权。
提供机构:
maas
创建时间:
2025-07-22



