MMLU

Name: MMLU
Creator: maas
Published: 2025-11-07 16:23:18
License: 暂无描述

魔搭社区2025-11-07 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/sbintuitions/MMLU

下载链接

链接失效反馈

官方服务：

资源简介：

評価スコアの再現性確保と SB Intuitions 修正版の公開用クローンソース: [cais/mmlu on Hugging Face](https://huggingface.co/datasets/cais/mmlu) # Measuring Massive Multitask Language Understanding (MMLU) > This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. > The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. > This covers 57 tasks including elementary mathematics, US history, computer science, law, and more. > To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. ## Licensing Information [MIT License](https://choosealicense.com/licenses/mit/) ## Citation Information ``` @article{hendryckstest2021, title={Measuring Massive Multitask Language Understanding}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} } @article{hendrycks2021ethics, title={Aligning AI With Shared Human Values}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} } ``` # Subsets ## default - `qid` (`str`): データセット内の問題を一意識別するためのID - `subject` (`str`): 問題の[サブカテゴリ](https://github.com/hendrycks/test/blob/master/categories.py#L1)。全57種 - `tag` (`str`): 57種のサブカテゴリをまとめ上げる[カテゴリ](https://github.com/hendrycks/test/blob/master/categories.py#L61C1-L61C11)。全4種。[lm-evaluation-harness 由来の命名](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/README.md)を使用している - `description` (`str`): `subject` ごとに設定した入力プロンプトの system description。 [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/README.md) のものを使用している - `question` (`str`): 質問文 - `choices` (`list[str]`): 選択肢（4つ） - `answer` (`int`): choices に対応した正解選択肢のインデックス(0-3) ## wo_label_bias - subject ごとに見ても正解ラベルに偏りが出ないよう、選択肢（choices）を並び替えた版 - split: dev のみ

# 保障评估评分可复现性及SB Intuitions修订版公开克隆数据集数据来源：[Hugging Face平台的cais/mmlu数据集](https://huggingface.co/datasets/cais/mmlu) ## 大规模多任务语言理解（MMLU，Measuring Massive Multitask Language Understanding） > 本数据集为大规模多任务评测集合，包含来自各知识领域的多项选择题。 > 评测覆盖人文科学、社会科学、自然科学及其他大众学习所需的重要领域。 > 其涵盖57项任务，包括初等数学、美国历史、计算机科学、法学等多个类别。 > 若要在该评测中取得高精度表现，模型需具备广博的世界知识与问题求解能力。 ## 许可协议信息采用[MIT许可协议](https://choosealicense.com/licenses/mit/) ## 引用信息 @article{hendryckstest2021, title={Measuring Massive Multitask Language Understanding}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} } @article{hendrycks2021ethics, title={Aligning AI With Shared Human Values}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} } ## 数据集子集 ## 默认子集 - `qid`（字符串类型）：用于唯一标识数据集中各问题的标识符 - `subject`（字符串类型）：问题所属的[子类别](https://github.com/hendrycks/test/blob/master/categories.py#L1)，共计57类 - `tag`（字符串类型）：对57个子类别进行归纳后的[类别](https://github.com/hendrycks/test/blob/master/categories.py#L61C1-L61C11)，共计4类。采用[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/README.md)中的命名规范 - `description`（字符串类型）：针对每个`subject`设置的输入提示系统描述，采用[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/README.md)中的配置 - `question`（字符串类型）：问题文本 - `choices`（字符串列表类型）：候选选项（共4项） - `answer`（整数类型）：与`choices`对应的正确选项索引（取值范围0-3） ## 无标签偏差子集 - 针对每个`subject`均未出现正解标签偏向问题的版本，即对候选选项（`choices`）进行了重排处理 - 数据划分：仅包含开发集（dev）

提供机构：

maas

创建时间：

2025-02-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集