MMLU-Pro
收藏魔搭社区2026-05-17 更新2024-06-01 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/MMLU-Pro
下载链接
链接失效反馈官方服务:
资源简介:
# MMLU-Pro Dataset
MMLU-Pro dataset is a more **robust** and **challenging** massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines.
|[**Github**](https://github.com/TIGER-AI-Lab/MMLU-Pro) | [**🏆Leaderboard**](https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro) | [**📖Paper**](https://arxiv.org/abs/2406.01574) |
## 🚀 What's New
- **\[2026.03.11\]** Added more cutting-edge frontier models to the leaderboard, including the Claude-4.6 series, Seed2.0 series, Qwen3.5 series, and Gemini-3.1-Pro, among others. Stay tuned for updated rankings and analysis.
- **\[2026.01.18\]** Fixed leading space issue in answer options (affected chemistry, physics, and other STEM subsets). This formatting inconsistency could have been exploited as a shortcut. Thanks to @giffmana and @fujikanaeda for identifying this.
- **\[2025.10.25\]** Posted a consolidated note on Health-category issues and minor category updates (does not change overall micro-averaged scores; may slightly affect per-category metrics, mainly Health/Psychology). See details: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro/discussions/36. Special thanks to @mkieffer for the professional and meticulous review.
- **\[2025.04.06\]** We corrected 15 answers in medical domain based on the recommendations of medical professionals, thanks to Dr. Robert (Bob) Hoyt and the subspecialists from the Mayo Clinic and INOVA.
- **\[2024.10.16\]** We have added Gemini-1.5-Flash-002, Gemini-1.5-Pro-002, Jamba-1.5-Large, Llama-3.1-Nemotron-70B-Instruct-HF and Ministral-8B-Instruct-2410 to our leaderboard.
- **\[2024.09.07\]** We have added Reflection-Llama-3.1-70B, Phi-3.5-mini-instruct and Grok-2 to our leaderboard.
- **\[2024.09.06\]** We corrected some errors with IDs 5457, 2634, 2817, 1289, 2394, and 7063.
- **\[2024.08.07\]** We corrected some errors in the math and engineering disciplines with IDs 7780, 8015, 8410, 8618, etc.
- **\[2024.07.20\]** We have added GPT-4o-mini and Mathstral-7B-v0.1 to our leaderboard.
- **\[2024.07.18\]** We have corrected some typos like \nrac -> \n\\\frac, \nactorial -> \n\\\factorial.
- **\[2024.07.11\]** MMLU-Pro was ingested into Airtrain, check this [**dataset explorer**](https://app.airtrain.ai/dataset/290ba84d-da8b-4358-9cf4-9e51506faa80/null/1/0) out. Thank Emmanuel for sharing!
- **\[2024.07.08\]** We have corrected the answer for the question with ID 6392 from D to B.
- **\[2024.07.06\]** We have added the Gemma-2-9B, Gemma-2-9B-it, DeepSeek-Coder-V2-Lite-Base, and DeepSeek-Coder-V2-Lite-Instruct to our leaderboard.
- **\[2024.07.05\]** We have corrected the answer for the question with ID 143 from A to I.
## 1. What's the difference between MMLU-Pro and MMLU?
Compared to the original MMLU, there are three major differences:
- The original MMLU dataset only contains 4 options, MMLU-Pro increases it to 10 options. The increase in options will make the evaluation more realistic and challenging. The random guessing will lead to a much lower score.
- The original MMLU dataset contains mostly knowledge-driven questions without requiring much reasoning. Therefore, PPL results are normally better than CoT. In our dataset, we increase the problem difficulty and integrate more reasoning-focused problems. In MMLU-Pro, CoT can be 20% higher than PPL.
- By increasing the distractor numbers, we significantly reduce the probability of correct guess by chance to boost the benchmark’s robustness. Specifically, with 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro

## 2. Dataset Summary
- **Questions and Options:** Each question within the dataset typically has **ten** multiple-choice options, except for some that were reduced during the manual review process to remove unreasonable choices. This increase from the original **four** options per question is designed to enhance complexity and robustness, necessitating deeper reasoning to discern the correct answer among a larger pool of potential distractors.
- **Sources:** The dataset consolidates questions from several sources:
- **Original MMLU Questions:** Part of the dataset comes from the original MMLU dataset. We remove the trivial and ambiguous questions.
- **STEM Website:** Hand-picking high-quality STEM problems from the Internet.
- **TheoremQA:** High-quality human-annotated questions requiring theorems to solve.
- **SciBench:** Science questions from college exams.
- **Disciplines Covered by the Newly Added Data:** The subjects that have been enhanced with questions from the STEM Website, TheoremQA, and SciBench are biology, business, chemistry, computer science, economics, engineering, math, physics, and psychology.
| Discipline | Number of Questions | From Original MMLU | Newly Added |
|:------------------|:--------------------|:-------------------|:------------|
| Math | 1351 | 846 | 505 |
| Physics | 1299 | 411 | 888 |
| Chemistry | 1132 | 178 | 954 |
| Law | 1101 | 1101 | 0 |
| Engineering | 969 | 67 | 902 |
| Other | 924 | 924 | 0 |
| Economics | 844 | 444 | 400 |
| Health | 818 | 818 | 0 |
| Psychology | 798 | 493 | 305 |
| Business | 789 | 155 | 634 |
| Biology | 717 | 219 | 498 |
| Philosophy | 499 | 499 | 0 |
| Computer Science | 410 | 274 | 136 |
| History | 381 | 381 | 0 |
| **Total** | **12032** | 6810 | 5222 |

## 3. Dataset Construction

- **Initial Filtering:** The construction process began with a comprehensive review of the original MMLU dataset to identify and retain only those questions that meet a higher threshold of difficulty and relevance.
- **Question Collection and Integration:** Additional questions were carefully selected from STEM websites, theoremQA, and scibench based on their ability to challenge the analytical capabilities of advanced models. The selection criteria focused on the complexity of the problems and the quality of the questions.
- **Option Augmentation:** To further enhance the dataset, we employed GPT-4 to augment the number of choices per question from **four** to **ten**. This process was not merely about adding more options but involved generating plausible distractors that require discriminative reasoning to navigate.
- **Expert Review:** Each question and its associated options underwent rigorous scrutiny by a panel of over ten experts. These experts ensured that the questions were not only challenging and comprehensive but also accurate and fair. This step was crucial to maintain the integrity and utility of the dataset as a benchmarking tool.
## 4. Leaderboard
For the updated leaderboard, please refer to https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro. You can submit your evaluation there. Some of the results are run by us while some of the results are obtained by others. Normally we use 5-shot, some models like Gemini use 0-shot.
If you want to reproduce our results, please check out https://github.com/TIGER-AI-Lab/MMLU-Pro for the evaluation scripts. We also cache our model predictions in https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main/eval_results.
## 5. CoT vs Direct Evaluation
Unlike the original MMLU, which favors PPL evaluation. MMLU-Pro requires CoT reasoning to achieve better results.
|Models | Prompting | Overall | Biology | Business | Chemistry | ComputerScience | Economics | Engineering | Health | History | Law | Math | Philosophy | Physics | Psychology | Other |
|:----------------------------|:----------|:--------|:--------|:---------|:----------|:-----------------|:----------|-------------|:-------|:--------|:-------|:-------|:-----------|:--------|:-----------|:-------|
| GPT-4o | CoT | 0.7255 | 0.8675 | 0.7858 | 0.7393 | 0.7829 | 0.808 | 0.55 | 0.7212 | 0.7007 | 0.5104 | 0.7609 | 0.7014 | 0.7467 | 0.7919 | 0.7748 |
The non-CoT results are reported in the following table. As you can see, the performance dropped by as much as 19% without chain-of-thought reasoning. It reflects the challenging nature of our dataset.
|Models | Prompting | Overall | Biology | Business | Chemistry | ComputerScience | Economics | Engineering | Health | History | Law | Math | Philosophy | Physics | Psychology | Other |
|:----------------------------|:----------|:--------|:--------|:---------|:----------|:-----------------|:-----------|------------|:-------|:--------|:------|:------|:-----------|:--------|:-----------|:------|
| GPT-4o | Direct | 0.5346 | 0.8102 | 0.392 | 0.3447 | 0.5813 | 0.6899 | 0.3981 | 0.6933 | 0.6949 | 0.542 | 0.3427| 0.6614 | 0.3971 | 0.7628 | 0.6391|
## 6. MMLU v.s. MMLU-Pro Results
| Models | Original MMLU Score | MMLU Pro Score | Drop |
|:------------------------------|:--------------------|:---------------|:-----------|
| GPT-4o | 0.887 | 0.7255 | 0.1615 |
| Claude-3-Opus | 0.868 | 0.6845 | 0.1835 |
| Claude-3-Sonnet | 0.815 | 0.5511 | 0.2639 |
| Gemini 1.5 Flash | 0.789 | 0.5912 | 0.1978 |
| Llama-3-70B-Instruct | 0.820 | 0.5620 | 0.258 |
We can observe that some models like GPT-4o only drop by 16% while some models like Mixtral-8x7B drop more than 30%.
## 7. Dataset Maintenance
There are mistakes in the dataset. If you find anyone, please paste the question_id to the issue page, we will modify it accordingly. Our team is commmitted to maintain this dataset in the long run to ensure its quality!
# MMLU-Pro 数据集
MMLU-Pro 数据集是一款更具鲁棒性与挑战性的大规模多任务理解基准数据集,专为更严谨地评测大语言模型(Large Language Model, LLM)的能力而打造。本数据集涵盖12000余道跨多学科的复杂问题。
|[**GitHub**](https://github.com/TIGER-AI-Lab/MMLU-Pro) | [**🏆排行榜**](https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro) | [**📖论文**](https://arxiv.org/abs/2406.01574) |
## 🚀 新增动态
- **[2025.10.25]** 发布了关于健康类问题的整合说明与小范围学科类别更新(不改变整体微平均得分;可能轻微影响单类别指标,主要为健康/心理学领域)。详细信息请参阅:https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro/discussions/36。特别感谢@mkieffer 提供的专业且细致的评审。
- **[2025.04.06]** 我们根据医学专业人士的建议修正了医学领域的15道问题的答案,感谢Robert(Bob)Hoyt博士以及梅奥诊所(Mayo Clinic)和INOVA医疗集团的亚专科医师。
- **[2024.10.16]** 我们将Gemini-1.5-Flash-002、Gemini-1.5-Pro-002、Jamba-1.5-Large、Llama-3.1-Nemotron-70B-Instruct-HF以及Ministral-8B-Instruct-2410加入排行榜。
- **[2024.09.07]** 我们将Reflection-Llama-3.1-70B、Phi-3.5-mini-instruct以及Grok-2加入排行榜。
- **[2024.09.06]** 我们修正了ID为5457、2634、2817、1289、2394及7063的问题的错误。
- **[2024.08.07]** 我们修正了数学与工程学科中ID为7780、8015、8410、8618等问题的错误。
- **[2024.07.20]** 我们将GPT-4o-mini与Mathstral-7B-v0.1加入排行榜。
- **[2024.07.18]** 我们修正了部分排版错误,如
rac ->
\frac、
actorial ->
\factorial。
- **[2024.07.11]** MMLU-Pro已接入Airtrain,您可访问此[数据集探索器](https://app.airtrain.ai/dataset/290ba84d-da8b-4358-9cf4-9e51506faa80/null/1/0)查看。感谢Emmanuel的分享!
- **[2024.07.08]** 我们将ID为6392的问题的答案从D修正为B。
- **[2024.07.06]** 我们将Gemma-2-9B、Gemma-2-9B-it、DeepSeek-Coder-V2-Lite-Base以及DeepSeek-Coder-V2-Lite-Instruct加入排行榜。
- **[2024.07.05]** 我们将ID为143的问题的答案从A修正为I。
## 1. MMLU-Pro与原版MMLU有何区别?
相较于原版MMLU,二者存在三大核心差异:
- 原版MMLU数据集仅包含4个选项,而MMLU-Pro将选项数量提升至10个。选项数量的增加使评测更贴近真实场景且更具挑战性,随机猜测的得分将大幅降低。
- 原版MMLU数据集以知识驱动型问题为主,几乎无需复杂推理,因此通常情况下困惑度(Perplexity, PPL)评测结果优于思维链(Chain-of-Thought, CoT)评测。而本数据集提升了问题难度,整合了更多侧重推理的题目。在MMLU-Pro中,采用CoT推理的评测结果可比PPL高出20%。
- 通过增加干扰项数量,我们大幅降低了随机猜对正确答案的概率,以此提升基准测试的鲁棒性。具体而言,在测试24种不同提示风格的场景下,模型得分对提示词变化的敏感度从原版MMLU的4%-5%降至MMLU-Pro的仅2%。

## 2. 数据集概览
- **问题与选项**:数据集中的每道问题通常包含10个多项选择选项,部分经人工评审移除不合理选项的题目除外。相较于原版每道题仅4个选项的设置,此举旨在提升任务复杂度与数据集鲁棒性,要求模型在更多干扰项中通过深度推理甄别正确答案。
- **数据来源**:本数据集整合了多渠道的问题:
- **原版MMLU问题**:部分数据来自原版MMLU数据集,我们移除了其中过于简单或存在歧义的题目。
- **STEM 网站资源**:从互联网上精心挑选的高质量STEM(科学、技术、工程、数学)问题。
- **TheoremQA**:高质量的人工标注问题,需借助定理求解。
- **SciBench**:源自大学考试的科学类问题。
- **新增数据覆盖的学科**:通过STEM网站、TheoremQA与SciBench补充问题的学科包括生物学、商学、化学、计算机科学、经济学、工程学、数学、物理学与心理学。
| 学科 | 问题数量 | 源自原版MMLU | 新增数量 |
|:------------------|:--------------------|:-------------------|:------------|
| 数学 | 1351 | 846 | 505 |
| 物理学 | 1299 | 411 | 888 |
| 化学 | 1132 | 178 | 954 |
| 法学 | 1101 | 1101 | 0 |
| 工程学 | 969 | 67 | 902 |
| 其他 | 924 | 924 | 0 |
| 经济学 | 844 | 444 | 400 |
| 健康学 | 818 | 818 | 0 |
| 心理学 | 798 | 493 | 305 |
| 商学 | 789 | 155 | 634 |
| 生物学 | 717 | 219 | 498 |
| 哲学 | 499 | 499 | 0 |
| 计算机科学 | 410 | 274 | 136 |
| 历史学 | 381 | 381 | 0 |
| **总计** | **12032** | 6810 | 5222 |

## 3. 数据集构建流程

- **初始筛选**:数据集构建首先对原版MMLU数据集进行全面评审,仅保留符合更高难度与相关性阈值的问题。
- **问题收集与整合**:基于挑战高级模型分析能力的标准,从STEM网站、TheoremQA与SciBench中精心筛选额外问题,筛选标准聚焦于问题复杂度与题目质量。
- **选项扩充**:为进一步提升数据集质量,我们使用GPT-4将每道题的选项数量从4个扩充至10个。该过程并非简单增加选项,而是生成需通过判别式推理才能辨别的合理干扰项。
- **专家评审**:每道问题及其配套选项均经过十余位专家组成的评审团的严格审查。专家团队确保题目兼具挑战性与全面性,同时保证准确性与公平性。此步骤对维护本数据集作为基准评测工具的完整性与实用性至关重要。
## 4. 排行榜
如需查看更新后的排行榜,请访问 https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro,您可在此提交模型评测结果。部分评测结果由我们团队运行,其余则由第三方提供。通常我们采用少样本(Few-shot)提示方式,部分模型如Gemini则采用零样本(Zero-shot)提示。
若需复现我们的评测结果,请参阅 https://github.com/TIGER-AI-Lab/MMLU-Pro 获取评测脚本。我们还将模型预测结果缓存于 https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main/eval_results 中。
## 5. 思维链(CoT)与直接评估对比
与原版MMLU更适配PPL评测的特性不同,MMLU-Pro需要采用CoT推理才能获得更优的评测结果。
|模型名称 | 提示方式 | 整体得分 | 生物学 | 商学 | 化学 | 计算机科学 | 经济学 | 工程学 | 健康学 | 历史学 | 法学 | 数学 | 哲学 | 物理学 | 心理学 | 其他 |
|:----------------------------|:----------|:--------|:--------|:---------|:----------|:-----------------|:----------|:-------------|:-------|:--------|:-------|:-------|:-----------|:--------|:-----------|:-------|
| GPT-4o | CoT | 0.7255 | 0.8675 | 0.7858 | 0.7393 | 0.7829 | 0.808 | 0.55 | 0.7212 | 0.7007 | 0.5104 | 0.7609 | 0.7014 | 0.7467 | 0.7919 | 0.7748 |
非CoT评测结果如下表所示。可以看到,未采用思维链推理时,模型性能降幅最高可达19%,这也体现了本数据集的挑战性。
|模型名称 | 提示方式 | 整体得分 | 生物学 | 商学 | 化学 | 计算机科学 | 经济学 | 工程学 | 健康学 | 历史学 | 法学 | 数学 | 哲学 | 物理学 | 心理学 | 其他 |
|:----------------------------|:----------|:--------|:--------|:---------|:----------|:-----------------|:-----------|:------------|:-------|:--------|:------|:------|:-----------|:--------|:-----------|:------|
| GPT-4o | 直接评估 | 0.5346 | 0.8102 | 0.392 | 0.3447 | 0.5813 | 0.6899 | 0.3981 | 0.6933 | 0.6949 | 0.542 | 0.3427| 0.6614 | 0.3971 | 0.7628 | 0.6391|
## 6. 原版MMLU与MMLU-Pro评测结果对比
| 模型名称 | 原版MMLU得分 | MMLU Pro得分 | 得分降幅 |
|:------------------------------|:--------------------|:---------------|:-----------|
| GPT-4o | 0.887 | 0.7255 | 0.1615 |
| Claude-3-Opus | 0.868 | 0.6845 | 0.1835 |
| Claude-3-Sonnet | 0.815 | 0.5511 | 0.2639 |
| Gemini 1.5 Flash | 0.789 | 0.5912 | 0.1978 |
| Llama-3-70B-Instruct | 0.820 | 0.5620 | 0.258 |
我们可以观察到,部分模型如GPT-4o的得分仅下降16%,而另一些模型如Mixtral-8x7B的得分降幅则超过30%。
## 7. 数据集维护
本数据集可能存在疏漏与错误。若您发现任何问题,请将问题ID粘贴至议题页面,我们将及时修正。我们团队将长期维护本数据集以确保其质量!
提供机构:
maas
创建时间:
2024-05-29



