five

mmlu-redux

收藏
魔搭社区2026-05-21 更新2025-03-29 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/mmlu-redux
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for MMLU-Redux <!-- Provide a quick summary of the dataset. --> > [!TIP] > Please consider using [MMLU-Redux-2.0](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0) which contains all 57 MMLU subjects. MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. ## News - [2025.02.08] We corrected one annotation in High School Mathematics subset, as noted in the [PlatinumBench paper](https://arxiv.org/abs/2502.03461). - [2025.01.23] MMLU-Redux is accepted to NAACL 2025! ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> Each data point in MMLU-Redux contains seven columns: - **question** (`str`): The original MMLU question. - **choices** (`List[str]`): The original list of four choices associated with the question from the MMLU dataset. - **answer** (`int`): The MMLU ground truth label in the form of an array index between 0 and 3. - **error_type** (`str`): The annotated error_type. The values can be one of the six error types proposed in the taxonomy ("ok", "bad_question_clarity", "bad_options_clarity", "no_correct_answer", "multiple_correct_answers", "wrong_groundtruth") and "expert". - **source** (`str`): The potential source of the question. - **correct_answer** (`str`): In the case of "no_correct_answer" and "wrong_groundtruth", the annotators can suggest the alternative correct answer. - **potential_reason** (`str`): A free text column for the annotators to note what they believe to have caused the error. The question, choices, and answer columns are taken from [cais/mmlu](https://huggingface.co/datasets/cais/mmlu). - **Dataset Repository:** https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux - **Code Repository:** https://github.com/aryopg/mmlu-redux - **Alternative Dataset Repository:** https://zenodo.org/records/11624987 - **Paper:** https://arxiv.org/abs/2406.04127 - **Curated by:** Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini - **Language(s) (NLP):** English - **License:** CC-BY-4.0 ### Taxonomy ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644f895e23d7eb05ca695054/ChI5KZPPnkRQv1olPifef.png) We develop a hierarchical taxonomy to classify the various errors identified in MMLU into specific error types. This figure illustrates our taxonomy for categorising MMLU errors. We categorise errors into two primary groups: samples with errors in the clarity of the questions (Type 1, Question Assessment) and samples with errors in the ground truth answer (Type 2, Ground Truth Verification). While Type 1 only includes Bad Question Clarity, Type 2, is further divided into the more fine-grained error types. Question Assessment (Type 1): - **(1a) Bad Question Clarity:** The question is poorly presented in terms of various aspects, such as clarity, grammar, and sufficiency of information. For instance, referring to a previous question. - **(1b) Bad Options Clarity:** The options are unclear, similar, or irrelevant to the question. Most errors in this category stem from incorrect parsing of the options from the original source. For example, a single option might be incorrectly split into two separate options. Ground Truth Verification (Type 2): - **(2a) No Correct Answer:** None of the options correctly answer the question. This error can, for example, arise when the ground-truth options are omitted to reduce the number of options from five to four. - **(2b) Multiple Correct Answers:** More than one option can be selected as the answer to the question. For example, the options contain a synonym of the ground truth label. - **(2c) Wrong Ground Truth:** The correct answer differs from the ground truth provided in MMLU. This type of error occurs when the annotated label differs from the correct label, which may be caused by a mistake during manual annotation. ### Dataset Sources <!-- Provide the basic links for the dataset. --> The data used to create MMLU-Redux was obtained from [cais/mmlu](https://huggingface.co/datasets/cais/mmlu), which is also utilised in the [lm-eval-harness framework](https://github.com/EleutherAI/lm-evaluation-harness). To ensure uniformity of our results, the language model (LM) predictions used in our performance analyses were obtained from the [Holistic Evaluation of Language Models (HELM) leaderboard v1.3.0, released on May 15th, 2024](https://crfm.stanford.edu/helm/mmlu/v1.3.0/). We selected 30 MMLU subjects. We first chose the 20 subjects with the lowest state-of-the-art accuracy scores on the HELM leaderboard. These subjects are College Mathematics, Virology, College Chemistry, High School Mathematics, Abstract Algebra, Global Facts, Formal Logic, High School Physics, Professional Law, Machine Learning, High School Chemistry, Econometrics, Professional Accounting, College Physics, Anatomy, College Computer Science, High School Statistics, Electrical Engineering, Public Relations, and College Medicine. Since there were multiple subjects related to mathematics, we randomly omitted one (Abstract Algebra) and replaced it with the next worst-performing non-mathematical subject (Business Ethics). The remaining 10 subjects were selected randomly without considering performance: Human Aging, High School Macroeconomics, Clinical Knowledge, Logical Fallacies, Philosophy, Conceptual Physics, High School US History, Miscellaneous, High School Geography, and Astronomy. We randomly subsampled 100 questions per MMLU subject to be presented to the annotators. The annotators are instructed to follow the introduced taxonomy by first assessing the question presentation, and then by verifying the ground truth MMLU label. The annotators were encouraged to perform an exact match search using a search engine to find occurrences of the question and multiple-choice options from credible sources. If the annotators found an exact match of the question-options pair, the annotators were asked to evaluate the answer provided by the source. Regardless of whether a label was found in the source, and whether the MMLU label is the same or not, the annotators were asked to decide whether they would follow the label using their expertise. In the cases where an exact match was not found, the annotators were asked to search for supporting evidence from trusted sources, such as government websites, textbooks, and/or other reputable organisations (*e.g., World Health Organisation (WHO)*). In cases where the annotators are still unsure, they were asked to annotate the question with "Expert", denoting that the question requires more expertise. MMLU-Redux comprises subsampled test splits of the aforementioned thirty MMLU subsets. ## Uses <!-- This section describes suitable use cases for the dataset. --> To reproduce our results or perform analyses similar to those presented in this study, the user may download the data and utilise all the columns. MMLU-Redux contains both correct and erroneous instances, so the user should look at the value in column "error_type" to filter samples based on the specific error type. In those cases where the error is "no_correct_answer", "multiple_correct_answers" or "wrong_groundtruth", the users may examine the suggested answer reported in the "correct_answer" column. The user should consider that the questions and the options reported are the same as those in the MMLU dataset, and they have not been modified even when affected by bad clarity. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644f895e23d7eb05ca695054/CXuAtMrd1odrSFhHGuIxO.png) ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** ``` @misc{gema2024mmlu, title={Are We Done with MMLU?}, author={Aryo Pradipta Gema and Joshua Ong Jun Leang and Giwon Hong and Alessio Devoto and Alberto Carlo Maria Mancino and Rohit Saxena and Xuanli He and Yu Zhao and Xiaotang Du and Mohammad Reza Ghasemi Madani and Claire Barale and Robert McHardy and Joshua Harris and Jean Kaddour and Emile van Krieken and Pasquale Minervini}, year={2024}, eprint={2406.04127}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` <!-- ## Glossary [optional] --> <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> <!-- [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] --> ## Dataset Card Contact - aryo.gema@ed.ac.uk - p.minervini@ed.ac.uk

# MMLU-Redux 数据集卡片 <!-- 提供该数据集的简要概述。 --> > [!提示] > 请考虑使用 [MMLU-Redux-2.0](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0),该版本包含全部57个MMLU(大规模多任务语言理解,Massive Multitask Language Understanding)学科。 MMLU-Redux是覆盖30个MMLU学科、共3000条经人工重新标注的问题的子集。 ## 最新动态 - [2025.02.08] 我们修正了高中数学(High School Mathematics)子集内的一条标注,详见 [PlatinumBench 论文](https://arxiv.org/abs/2502.03461)。 - [2025.01.23] MMLU-Redux已被NAACL 2025收录! ## 数据集详情 ### 数据集描述 <!-- 提供该数据集的详细概述。 --> 每个MMLU-Redux数据样本包含7个字段: - **question** (`str`):原始MMLU问题。 - **choices** (`List[str]`):来自MMLU数据集的、与该问题对应的原始4个选项列表。 - **answer** (`int`):MMLU的标准答案标签,形式为0至3之间的数组索引。 - **error_type** (`str`):标注的错误类型。可选值为该分类体系中提出的6种错误类型("ok"、"bad_question_clarity"、"bad_options_clarity"、"no_correct_answer"、"multiple_correct_answers"、"wrong_groundtruth")以及"expert"。 - **source** (`str`):该问题的潜在来源。 - **correct_answer** (`str`):当错误类型为"no_correct_answer"和"wrong_groundtruth"时,标注者可给出备选标准答案。 - **potential_reason** (`str`):供标注者填写的自由文本字段,用于记录其认为导致该错误的原因。 其中question、choices和answer字段取自 [cais/mmlu](https://huggingface.co/datasets/cais/mmlu)。 - **数据集仓库**:https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux - **代码仓库**:https://github.com/aryopg/mmlu-redux - **替代数据集仓库**:https://zenodo.org/records/11624987 - **相关论文**:https://arxiv.org/abs/2406.04127 - **整理者**:Aryo Pradipta Gema、Joshua Ong Jun Leang、Giwon Hong、Alessio Devoto、Alberto Carlo Maria Mancino、Xuanli He、Yu Zhao、Xiaotang Du、Mohammad Reza Ghasemi Madani、Claire Barale、Robert McHardy、Joshua Harris、Jean Kaddour、Emile van Krieken、Pasquale Minervini - **自然语言处理所用语言**:英语 - **许可证**:CC-BY-4.0 ### 错误分类体系 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644f895e23d7eb05ca695054/ChI5KZPPnkRQv1olPifef.png) 我们构建了层级式分类体系,用于将MMLU中识别出的各类错误归类为特定错误类型。本图展示了我们用于对MMLU错误进行分类的体系。我们将错误分为两大主要类别:问题清晰度存在问题的样本(类型1:问题评估)以及标准答案存在问题的样本(类型2:标准答案验证)。其中类型1仅包含“问题清晰度不佳”,而类型2可进一步划分为更细粒度的错误类型。 问题评估(类型1): - **(1a) 问题清晰度不佳(Bad Question Clarity)**:问题在清晰度、语法、信息充足性等多方面呈现不佳。例如,引用了前一道问题的内容。 - **(1b) 选项清晰度不佳(Bad Options Clarity)**:选项模糊不清、彼此相似或与问题无关。此类别的多数错误源于对原始来源中选项的错误解析。例如,单个选项可能被错误拆分为两个独立选项。 标准答案验证(类型2): - **(2a) 无正确答案(No Correct Answer)**:所有选项均未正确回答该问题。此类错误可能源于为将选项数量从5个缩减至4个而遗漏了标准答案选项等场景。 - **(2b) 多个正确答案(Multiple Correct Answers)**:存在不止一个选项可作为该问题的答案。例如,选项中包含与标准答案标签同义的内容。 - **(2c) 标准答案错误(Wrong Ground Truth)**:正确答案与MMLU中提供的标准答案不符。此类错误发生在标注标签与正确标签不一致的场景,可能源于手动标注过程中的失误。 ### 数据集来源 <!-- 提供该数据集的基础链接。 --> 构建MMLU-Redux所用的数据取自 [cais/mmlu](https://huggingface.co/datasets/cais/mmlu),该数据集同时被用于 [lm-eval-harness 框架](https://github.com/EleutherAI/lm-evaluation-harness)。为确保实验结果的一致性,我们在性能分析中使用的大语言模型(Large Language Model, LLM)预测结果取自2024年5月15日发布的 [HELM(大语言模型整体评估,Holistic Evaluation of Language Models)排行榜v1.3.0](https://crfm.stanford.edu/helm/mmlu/v1.3.0/)。 我们选取了30个MMLU学科。首先选择了HELM排行榜上当前最佳模型准确率最低的20个学科,分别为:大学数学(College Mathematics)、病毒学(Virology)、大学化学(College Chemistry)、高中数学(High School Mathematics)、抽象代数(Abstract Algebra)、全球常识(Global Facts)、形式逻辑(Formal Logic)、高中物理(High School Physics)、专业法律(Professional Law)、机器学习(Machine Learning)、高中化学(High School Chemistry)、计量经济学(Econometrics)、专业会计(Professional Accounting)、大学物理(College Physics)、解剖学(Anatomy)、大学计算机科学(College Computer Science)、高中统计学(High School Statistics)、电气工程(Electrical Engineering)、公共关系(Public Relations)以及大学医学(College Medicine)。由于存在多个与数学相关的学科,我们随机移除了其中一个(抽象代数),并将其替换为下一个表现最差的非数学学科(商业伦理学,Business Ethics)。剩余10个学科则不考虑性能表现,随机选取:人类衰老(Human Aging)、高中宏观经济学(High School Macroeconomics)、临床知识(Clinical Knowledge)、逻辑谬误(Logical Fallacies)、哲学(Philosophy)、概念物理(Conceptual Physics)、高中美国历史(High School US History)、综合知识(Miscellaneous)、高中地理(High School Geography)以及天文学(Astronomy)。 我们为每个MMLU学科随机抽取100条问题,交由标注人员处理。标注人员需遵循我们提出的分类体系,首先评估问题的呈现质量,随后验证MMLU的标准答案标签。我们鼓励标注人员使用搜索引擎进行精确匹配搜索,以从可信来源中查找该问题及多项选择题选项的原文。若标注人员找到了问题-选项对的精确匹配版本,则需评估该来源所提供的答案。无论是否在来源中找到对应标签,也无论MMLU标签是否正确,标注人员均需基于自身专业知识决定是否遵循该标签。若未找到精确匹配的内容,标注人员需从可信来源(如政府网站、教科书及/或其他权威机构,例如世界卫生组织(World Health Organisation, WHO))中寻找佐证证据。若标注人员仍无法确定,则需将该问题标注为“Expert”,表示该问题需要更专业的知识才能判断。 MMLU-Redux由上述30个MMLU子集的测试拆分样本经随机抽取得到。 ## 适用场景 <!-- 本节描述了该数据集的合理使用场景。 --> 若要复现我们的研究结果或开展与本研究类似的分析,用户可下载数据集并使用所有字段。MMLU-Redux同时包含正确与存在错误的样本,因此用户可通过查看“error_type”字段的值,基于特定错误类型对样本进行筛选。当错误类型为“no_correct_answer”、“multiple_correct_answers”或“wrong_groundtruth”时,用户可参考“correct_answer”字段中给出的备选答案。需注意,本数据集中的问题与选项与MMLU数据集完全一致,即使其存在清晰度不佳的问题,也未做任何修改。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644f895e23d7eb05ca695054/CXuAtMrd1odrSFhHGuIxO.png) ## 引用格式 <!-- 若有介绍该数据集的论文或博客文章,需在此处附上APA及Bibtex格式的引用信息。 --> **BibTeX格式:** @misc{gema2024mmlu, title={Are We Done with MMLU?}, author={Aryo Pradipta Gema and Joshua Ong Jun Leang and Giwon Hong and Alessio Devoto and Alberto Carlo Maria Mancino and Rohit Saxena and Xuanli He and Yu Zhao and Xiaotang Du and Mohammad Reza Ghasemi Madani and Claire Barale and Robert McHardy and Joshua Harris and Jean Kaddour and Emile van Krieken and Pasquale Minervini}, year={2024}, eprint={2406.04127}, archivePrefix={arXiv}, primaryClass={cs.CL} } <!-- ## 术语表 [可选] --> <!-- 若有需要,可在此处添加有助于读者理解数据集或数据集卡片的术语与计算公式。 --> <!-- [更多信息待补充] ## 更多信息 [可选] [更多信息待补充] ## 数据集卡片作者 [可选] [更多信息待补充] --> ## 数据集卡片联系方式 - aryo.gema@ed.ac.uk - p.minervini@ed.ac.uk
提供机构:
maas
创建时间:
2025-03-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作