mmlu-redux-2.0

Name: mmlu-redux-2.0
Creator: maas
Published: 2026-05-22 22:55:53
License: 暂无描述

魔搭社区2026-05-22 更新2025-03-29 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/mmlu-redux-2.0

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for MMLU-Redux-2.0  MMLU-Redux is a subset of 5,700 manually re-annotated questions across 57 MMLU subjects. ## News - [2025.02.25] We corrected one annotation in Abstract Algebra subset, as noted in the Issue [#2](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0/discussions/2). - [2025.02.08] We corrected one annotation in High School Mathematics subset, as noted in the [PlatinumBench paper](https://arxiv.org/abs/2502.03461). - [2025.01.23] MMLU-Redux is accepted to NAACL 2025! ## Dataset Details ### Dataset Description  Each data point in MMLU-Redux contains seven columns: - **question** (`str`): The original MMLU question. - **choices** (`List[str]`): The original list of four choices associated with the question from the MMLU dataset. - **answer** (`int`): The MMLU ground truth label in the form of an array index between 0 and 3. - **error_type** (`str`): The annotated error_type. The values can be one of the six error types proposed in the taxonomy ("ok", "bad_question_clarity", "bad_options_clarity", "no_correct_answer", "multiple_correct_answers", "wrong_groundtruth") and "expert". - **source** (`str`): The potential source of the question. - **correct_answer** (`str`): In the case of "no_correct_answer" and "wrong_groundtruth", the annotators can suggest the alternative correct answer. - **potential_reason** (`str`): A free text column for the annotators to note what they believe to have caused the error. The question, choices, and answer columns are taken from [cais/mmlu](https://huggingface.co/datasets/cais/mmlu). - **Dataset Repository:** https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0 - **Code Repository:** https://github.com/aryopg/mmlu-redux - **Alternative Dataset Repository:** https://zenodo.org/records/11624987 - **Paper:** https://arxiv.org/abs/2406.04127 - **Curated by:** Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Rohit Saxena, Alessio Devoto, Alberto Carlo Maria Mancino, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini - **Language(s) (NLP):** English - **License:** CC-BY-4.0 ### Taxonomy ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644f895e23d7eb05ca695054/ChI5KZPPnkRQv1olPifef.png) We develop a hierarchical taxonomy to classify the various errors identified in MMLU into specific error types. This figure illustrates our taxonomy for categorising MMLU errors. We categorise errors into two primary groups: samples with errors in the clarity of the questions (Type 1, Question Assessment) and samples with errors in the ground truth answer (Type 2, Ground Truth Verification). While Type 1 only includes Bad Question Clarity, Type 2, is further divided into the more fine-grained error types. Question Assessment (Type 1): - **(1a) Bad Question Clarity:** The question is poorly presented in terms of various aspects, such as clarity, grammar, and sufficiency of information. For instance, referring to a previous question. - **(1b) Bad Options Clarity:** The options are unclear, similar, or irrelevant to the question. Most errors in this category stem from incorrect parsing of the options from the original source. For example, a single option might be incorrectly split into two separate options. Ground Truth Verification (Type 2): - **(2a) No Correct Answer:** None of the options correctly answer the question. This error can, for example, arise when the ground-truth options are omitted to reduce the number of options from five to four. - **(2b) Multiple Correct Answers:** More than one option can be selected as the answer to the question. For example, the options contain a synonym of the ground truth label. - **(2c) Wrong Ground Truth:** The correct answer differs from the ground truth provided in MMLU. This type of error occurs when the annotated label differs from the correct label, which may be caused by a mistake during manual annotation. ### Dataset Sources  The data used to create MMLU-Redux was obtained from [cais/mmlu](https://huggingface.co/datasets/cais/mmlu), which is also utilised in the [lm-eval-harness framework](https://github.com/EleutherAI/lm-evaluation-harness). To ensure uniformity of our results, the language model (LM) predictions used in our performance analyses were obtained from the [Holistic Evaluation of Language Models (HELM) leaderboard v1.3.0, released on May 15th, 2024](https://crfm.stanford.edu/helm/mmlu/v1.3.0/). We randomly subsampled 100 questions per MMLU subject to be presented to the annotators. The annotators are instructed to follow the introduced taxonomy by first assessing the question presentation, and then by verifying the ground truth MMLU label. The annotators were encouraged to perform an exact match search using a search engine to find occurrences of the question and multiple-choice options from credible sources. If the annotators found an exact match of the question-options pair, the annotators were asked to evaluate the answer provided by the source. Regardless of whether a label was found in the source, and whether the MMLU label is the same or not, the annotators were asked to decide whether they would follow the label using their expertise. In the cases where an exact match was not found, the annotators were asked to search for supporting evidence from trusted sources, such as government websites, textbooks, and/or other reputable organisations (*e.g., World Health Organisation (WHO)*). In cases where the annotators are still unsure, they were asked to annotate the question with "Expert", denoting that the question requires more expertise. MMLU-Redux comprises subsampled test splits of the aforementioned thirty MMLU subsets. ## Uses  To reproduce our results or perform analyses similar to those presented in this study, the user may download the data and utilise all the columns. MMLU-Redux contains both correct and erroneous instances, so the user should look at the value in column "error_type" to filter samples based on the specific error type. In those cases where the error is "no_correct_answer", "multiple_correct_answers" or "wrong_groundtruth", the users may examine the suggested answer reported in the "correct_answer" column. The user should consider that the questions and the options reported are the same as those in the MMLU dataset, and they have not been modified even when affected by bad clarity. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644f895e23d7eb05ca695054/CXuAtMrd1odrSFhHGuIxO.png) ## Citation  **BibTeX:** ``` @misc{gema2024mmlu, title={Are We Done with MMLU?}, author={Aryo Pradipta Gema and Joshua Ong Jun Leang and Giwon Hong and Alessio Devoto and Alberto Carlo Maria Mancino and Rohit Saxena and Xuanli He and Yu Zhao and Xiaotang Du and Mohammad Reza Ghasemi Madani and Claire Barale and Robert McHardy and Joshua Harris and Jean Kaddour and Emile van Krieken and Pasquale Minervini}, year={2024}, eprint={2406.04127}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```    ## Dataset Card Contact - aryo.gema@ed.ac.uk - p.minervini@ed.ac.uk

# MMLU-Redux-2.0 数据集卡片  MMLU-Redux 是涵盖57个大规模多任务语言理解（MMLU）学科的5700条人工重新标注问题的子集。 ## 新闻 - [2025.02.25] 我们修正了抽象代数（Abstract Algebra）子集中的一处标注，详见讨论帖 [#2](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0/discussions/2)。 - [2025.02.08] 我们修正了高中数学（High School Mathematics）子集中的一处标注，详见《PlatinumBench》论文（https://arxiv.org/abs/2502.03461）。 - [2025.01.23] MMLU-Redux 已被 NAACL 2025 收录！ ## 数据集详情 ### 数据集描述  MMLU-Redux 的每个数据点包含7个字段： - **question**（字符串类型）：原始MMLU问题。 - **choices**（字符串列表类型）：MMLU数据集中该问题对应的原始4个选项列表。 - **answer**（整数类型）：MMLU的标准答案标签，形式为0到3之间的数组索引。 - **error_type**（字符串类型）：标注的错误类型，取值为该分类体系中提出的6种错误类型之一（"ok"、"bad_question_clarity"、"bad_options_clarity"、"no_correct_answer"、"multiple_correct_answers"、"wrong_groundtruth"）或"expert"。 - **source**（字符串类型）：该问题的潜在来源。 - **correct_answer**（字符串类型）：当错误类型为"no_correct_answer"和"wrong_groundtruth"时，标注人员可提供替代标准答案。 - **potential_reason**（字符串类型）：供标注人员填写的自由文本字段，用于说明他们认为的错误成因。上述question、choices和answer字段源自[cais/mmlu](https://huggingface.co/datasets/cais/mmlu)。 - **数据集仓库**：https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0 - **代码仓库**：https://github.com/aryopg/mmlu-redux - **替代数据集仓库**：https://zenodo.org/records/11624987 - **论文**：https://arxiv.org/abs/2406.04127 - **整理者**：Aryo Pradipta Gema、Joshua Ong Jun Leang、Giwon Hong、Rohit Saxena、Alessio Devoto、Alberto Carlo Maria Mancino、Xuanli He、Yu Zhao、Xiaotang Du、Mohammad Reza Ghasemi Madani、Claire Barale、Robert McHardy、Joshua Harris、Jean Kaddour、Emile van Krieken、Pasquale Minervini - **自然语言处理所用语言**：英语 - **许可证**：CC-BY-4.0 ### 分类体系 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644f895e23d7eb05ca695054/ChI5KZPPnkRQv1olPifef.png) 我们开发了一套层级分类体系，用于将MMLU中识别的各类错误归类为特定的错误类型。本图展示了我们用于分类MMLU错误的分类体系。我们将错误分为两大主要类别：问题清晰度存在问题的样本（类型1，问题评估）和标准答案存在错误的样本（类型2，标准答案验证）。其中类型1仅包含“问题清晰度不佳”，而类型2可进一步细分为更细粒度的错误类型。 #### 问题评估（类型1）： - **(1a) 问题清晰度不佳**：问题在清晰度、语法、信息充分性等多方面呈现较差，例如引用了前序问题。 - **(1b) 选项清晰度不佳**：选项模糊、相似或与问题无关。此类错误大多源于对原始来源中选项的错误解析，例如单个选项被错误拆分为两个独立选项。 #### 标准答案验证（类型2）： - **(2a) 无正确答案**：所有选项均无法正确回答该问题。此类错误可能因原始选项从5个缩减为4个时遗漏了标准答案而产生。 - **(2b) 存在多个正确答案**：存在多个可作为该问题答案的选项。例如选项中包含标准答案的同义词。 - **(2c) 标准答案错误**：正确答案与MMLU中提供的标准答案不符。此类错误可能因人工标注过程中的失误导致标注标签与正确标签不一致。 ## 数据集来源  MMLU-Redux 的构建数据源自[cais/mmlu](https://huggingface.co/datasets/cais/mmlu)，该数据集同时被用于[lm-eval-harness框架](https://github.com/EleutherAI/lm-evaluation-harness)。为确保实验结果的一致性，我们在性能分析中使用的语言模型（LM, Large Language Model）预测结果取自2024年5月15日发布的《语言模型整体评估（HELM）排行榜v1.3.0》（https://crfm.stanford.edu/helm/mmlu/v1.3.0/）。我们从每个MMLU学科中随机抽样100个问题提交给标注人员。标注人员需遵循我们提出的分类体系，先评估问题的呈现形式，再验证MMLU的标准答案标签。标注人员被鼓励使用搜索引擎进行精确匹配搜索，以从可信来源中查找该问题及其多项选择题选项的出现情况。若标注人员找到了问题-选项对的精确匹配结果，则需评估该来源提供的答案。无论是否在来源中找到标签，也无论MMLU标签是否正确，标注人员均需基于自身专业知识判断是否遵循该标签。若未找到精确匹配结果，标注人员需从可信来源（如政府网站、教科书和/或其他知名机构，例如世界卫生组织（WHO, World Health Organization））中查找佐证证据。若标注人员仍无法确定，则需将该问题标注为"Expert"，表示该问题需要更专业的知识才能判定。 MMLU-Redux 包含上述30个MMLU子集的抽样测试划分。 ## 使用场景  为复现本文的研究结果或开展类似本研究的分析，用户可下载数据集并使用所有字段。 MMLU-Redux 同时包含正确样本和存在错误的样本，因此用户可通过查看"error_type"字段的值，基于特定错误类型筛选样本。当错误类型为"no_correct_answer"、"multiple_correct_answers"或"wrong_groundtruth"时，用户可参考"correct_answer"字段中提供的建议答案。用户需注意，此处呈现的问题与选项与MMLU数据集中的内容完全一致，即便其清晰度不佳也未做修改。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644f895e23d7eb05ca695054/CXuAtMrd1odrSFhHGuIxO.png) ## 引用  **BibTeX：** @misc{gema2024mmlu, title={Are We Done with MMLU?}, author={Aryo Pradipta Gema and Joshua Ong Jun Leang and Giwon Hong and Alessio Devoto and Alberto Carlo Maria Mancino and Rohit Saxena and Xuanli He and Yu Zhao and Xiaotang Du and Mohammad Reza Ghasemi Madani and Claire Barale and Robert McHardy and Joshua Harris and Jean Kaddour and Emile van Krieken and Pasquale Minervini}, year={2024}, eprint={2406.04127}, archivePrefix={arXiv}, primaryClass={cs.CL} } ## 数据集卡片联系方式 - aryo.gema@ed.ac.uk - p.minervini@ed.ac.uk

提供机构：

maas

创建时间：

2025-03-25

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集