mmlu-redux-2.0|自然语言处理数据集|多学科语言理解数据集

魔搭社区2025-05-30 更新2025-03-29 收录

自然语言处理

多学科语言理解

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/mmlu-redux-2.0

下载链接

链接失效反馈

资源简介：

# Dataset Card for MMLU-Redux-2.0 MMLU-Redux is a subset of 5,700 manually re-annotated questions across 57 MMLU subjects. ## News - [2025.02.25] We corrected one annotation in Abstract Algebra subset, as noted in the Issue [#2](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0/discussions/2). - [2025.02.08] We corrected one annotation in High School Mathematics subset, as noted in the [PlatinumBench paper](https://arxiv.org/abs/2502.03461). - [2025.01.23] MMLU-Redux is accepted to NAACL 2025! ## Dataset Details ### Dataset Description Each data point in MMLU-Redux contains seven columns: - **question** (`str`): The original MMLU question. - **choices** (`List[str]`): The original list of four choices associated with the question from the MMLU dataset. - **answer** (`int`): The MMLU ground truth label in the form of an array index between 0 and 3. - **error_type** (`str`): The annotated error_type. The values can be one of the six error types proposed in the taxonomy ("ok", "bad_question_clarity", "bad_options_clarity", "no_correct_answer", "multiple_correct_answers", "wrong_groundtruth") and "expert". - **source** (`str`): The potential source of the question. - **correct_answer** (`str`): In the case of "no_correct_answer" and "wrong_groundtruth", the annotators can suggest the alternative correct answer. - **potential_reason** (`str`): A free text column for the annotators to note what they believe to have caused the error. The question, choices, and answer columns are taken from [cais/mmlu](https://huggingface.co/datasets/cais/mmlu). - **Dataset Repository:** https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0 - **Code Repository:** https://github.com/aryopg/mmlu-redux - **Alternative Dataset Repository:** https://zenodo.org/records/11624987 - **Paper:** https://arxiv.org/abs/2406.04127 - **Curated by:** Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Rohit Saxena, Alessio Devoto, Alberto Carlo Maria Mancino, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini - **Language(s) (NLP):** English - **License:** CC-BY-4.0 ### Taxonomy ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644f895e23d7eb05ca695054/ChI5KZPPnkRQv1olPifef.png) We develop a hierarchical taxonomy to classify the various errors identified in MMLU into specific error types. This figure illustrates our taxonomy for categorising MMLU errors. We categorise errors into two primary groups: samples with errors in the clarity of the questions (Type 1, Question Assessment) and samples with errors in the ground truth answer (Type 2, Ground Truth Verification). While Type 1 only includes Bad Question Clarity, Type 2, is further divided into the more fine-grained error types. Question Assessment (Type 1): - **(1a) Bad Question Clarity:** The question is poorly presented in terms of various aspects, such as clarity, grammar, and sufficiency of information. For instance, referring to a previous question. - **(1b) Bad Options Clarity:** The options are unclear, similar, or irrelevant to the question. Most errors in this category stem from incorrect parsing of the options from the original source. For example, a single option might be incorrectly split into two separate options. Ground Truth Verification (Type 2): - **(2a) No Correct Answer:** None of the options correctly answer the question. This error can, for example, arise when the ground-truth options are omitted to reduce the number of options from five to four. - **(2b) Multiple Correct Answers:** More than one option can be selected as the answer to the question. For example, the options contain a synonym of the ground truth label. - **(2c) Wrong Ground Truth:** The correct answer differs from the ground truth provided in MMLU. This type of error occurs when the annotated label differs from the correct label, which may be caused by a mistake during manual annotation. ### Dataset Sources The data used to create MMLU-Redux was obtained from [cais/mmlu](https://huggingface.co/datasets/cais/mmlu), which is also utilised in the [lm-eval-harness framework](https://github.com/EleutherAI/lm-evaluation-harness). To ensure uniformity of our results, the language model (LM) predictions used in our performance analyses were obtained from the [Holistic Evaluation of Language Models (HELM) leaderboard v1.3.0, released on May 15th, 2024](https://crfm.stanford.edu/helm/mmlu/v1.3.0/). We randomly subsampled 100 questions per MMLU subject to be presented to the annotators. The annotators are instructed to follow the introduced taxonomy by first assessing the question presentation, and then by verifying the ground truth MMLU label. The annotators were encouraged to perform an exact match search using a search engine to find occurrences of the question and multiple-choice options from credible sources. If the annotators found an exact match of the question-options pair, the annotators were asked to evaluate the answer provided by the source. Regardless of whether a label was found in the source, and whether the MMLU label is the same or not, the annotators were asked to decide whether they would follow the label using their expertise. In the cases where an exact match was not found, the annotators were asked to search for supporting evidence from trusted sources, such as government websites, textbooks, and/or other reputable organisations (*e.g., World Health Organisation (WHO)*). In cases where the annotators are still unsure, they were asked to annotate the question with "Expert", denoting that the question requires more expertise. MMLU-Redux comprises subsampled test splits of the aforementioned thirty MMLU subsets. ## Uses To reproduce our results or perform analyses similar to those presented in this study, the user may download the data and utilise all the columns. MMLU-Redux contains both correct and erroneous instances, so the user should look at the value in column "error_type" to filter samples based on the specific error type. In those cases where the error is "no_correct_answer", "multiple_correct_answers" or "wrong_groundtruth", the users may examine the suggested answer reported in the "correct_answer" column. The user should consider that the questions and the options reported are the same as those in the MMLU dataset, and they have not been modified even when affected by bad clarity. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644f895e23d7eb05ca695054/CXuAtMrd1odrSFhHGuIxO.png) ## Citation **BibTeX:** ``` @misc{gema2024mmlu, title={Are We Done with MMLU?}, author={Aryo Pradipta Gema and Joshua Ong Jun Leang and Giwon Hong and Alessio Devoto and Alberto Carlo Maria Mancino and Rohit Saxena and Xuanli He and Yu Zhao and Xiaotang Du and Mohammad Reza Ghasemi Madani and Claire Barale and Robert McHardy and Joshua Harris and Jean Kaddour and Emile van Krieken and Pasquale Minervini}, year={2024}, eprint={2406.04127}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Dataset Card Contact - aryo.gema@ed.ac.uk - p.minervini@ed.ac.uk

提供机构：

maas

创建时间：

2025-03-25

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4099个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

CE-CSL

CE-CSL数据集是由哈尔滨工程大学智能科学与工程学院创建的中文连续手语数据集，旨在解决现有数据集在复杂环境下的局限性。该数据集包含5,988个从日常生活场景中收集的连续手语视频片段，涵盖超过70种不同的复杂背景，确保了数据集的代表性和泛化能力。数据集的创建过程严格遵循实际应用导向，通过收集大量真实场景下的手语视频材料，覆盖了广泛的情境变化和环境复杂性。CE-CSL数据集主要应用于连续手语识别领域，旨在提高手语识别技术在复杂环境中的准确性和效率，促进聋人与听人社区之间的无障碍沟通。

arXiv 收录

GME Data

关于2021年GameStop股票活动的数据，包括每日合并的GME短期成交量数据、每日失败交付数据、可借股数、期权链数据以及不同时间框架的开盘/最高/最低/收盘/成交量条形图。

github 收录

GetData.IO - finance - Google Search

GetData.IO -

getdata.io 收录

Fruits-360

一个高质量的水果图像数据集，包含多种水果的图像，如苹果、香蕉、樱桃等，总计42345张图片，分为训练集和验证集，共有64个水果类别。

github 收录

flames-and-smoke-datasets

该仓库总结了多个公开的火焰和烟雾数据集，包括DFS、D-Fire dataset、FASDD、FLAME、BoWFire、VisiFire、fire-smoke-detect-yolov4、Forest Fire等数据集。每个数据集都有详细的描述，包括数据来源、图像数量、标注信息等。

github 收录