MMLU
收藏魔搭社区2025-11-07 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/sbintuitions/MMLU
下载链接
链接失效反馈官方服务:
资源简介:
評価スコアの再現性確保と SB Intuitions 修正版の公開用クローン
ソース: [cais/mmlu on Hugging Face](https://huggingface.co/datasets/cais/mmlu)
# Measuring Massive Multitask Language Understanding (MMLU)
> This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge.
> The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn.
> This covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
> To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability.
## Licensing Information
[MIT License](https://choosealicense.com/licenses/mit/)
## Citation Information
```
@article{hendryckstest2021,
title={Measuring Massive Multitask Language Understanding},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
@article{hendrycks2021ethics,
title={Aligning AI With Shared Human Values},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
```
# Subsets
## default
- `qid` (`str`): データセット内の問題を一意識別するためのID
- `subject` (`str`): 問題の[サブカテゴリ](https://github.com/hendrycks/test/blob/master/categories.py#L1)。全57種
- `tag` (`str`): 57種のサブカテゴリをまとめ上げる[カテゴリ](https://github.com/hendrycks/test/blob/master/categories.py#L61C1-L61C11)。全4種。[lm-evaluation-harness 由来の命名](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/README.md)を使用している
- `description` (`str`): `subject` ごとに設定した入力プロンプトの system description。 [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/README.md) のものを使用している
- `question` (`str`): 質問文
- `choices` (`list[str]`): 選択肢(4つ)
- `answer` (`int`): choices に対応した正解選択肢のインデックス(0-3)
## wo_label_bias
- subject ごとに見ても正解ラベルに偏りが出ないよう、選択肢(choices)を並び替えた版
- split: dev のみ
# 保障评估评分可复现性及SB Intuitions修订版公开克隆数据集
数据来源:[Hugging Face平台的cais/mmlu数据集](https://huggingface.co/datasets/cais/mmlu)
## 大规模多任务语言理解(MMLU,Measuring Massive Multitask Language Understanding)
> 本数据集为大规模多任务评测集合,包含来自各知识领域的多项选择题。
> 评测覆盖人文科学、社会科学、自然科学及其他大众学习所需的重要领域。
> 其涵盖57项任务,包括初等数学、美国历史、计算机科学、法学等多个类别。
> 若要在该评测中取得高精度表现,模型需具备广博的世界知识与问题求解能力。
## 许可协议信息
采用[MIT许可协议](https://choosealicense.com/licenses/mit/)
## 引用信息
@article{hendryckstest2021,
title={Measuring Massive Multitask Language Understanding},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
@article{hendrycks2021ethics,
title={Aligning AI With Shared Human Values},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
## 数据集子集
## 默认子集
- `qid`(字符串类型):用于唯一标识数据集中各问题的标识符
- `subject`(字符串类型):问题所属的[子类别](https://github.com/hendrycks/test/blob/master/categories.py#L1),共计57类
- `tag`(字符串类型):对57个子类别进行归纳后的[类别](https://github.com/hendrycks/test/blob/master/categories.py#L61C1-L61C11),共计4类。采用[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/README.md)中的命名规范
- `description`(字符串类型):针对每个`subject`设置的输入提示系统描述,采用[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/README.md)中的配置
- `question`(字符串类型):问题文本
- `choices`(字符串列表类型):候选选项(共4项)
- `answer`(整数类型):与`choices`对应的正确选项索引(取值范围0-3)
## 无标签偏差子集
- 针对每个`subject`均未出现正解标签偏向问题的版本,即对候选选项(`choices`)进行了重排处理
- 数据划分:仅包含开发集(dev)
提供机构:
maas
创建时间:
2025-02-13



