five

mmlu

收藏
魔搭社区2026-05-24 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/opencompass/mmlu
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for MMLU ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository**: https://github.com/hendrycks/test - **Paper**: https://arxiv.org/abs/2009.03300 ### Dataset Summary [Measuring Massive Multitask Language Understanding](https://arxiv.org/pdf/2009.03300) by [Dan Hendrycks](https://people.eecs.berkeley.edu/~hendrycks/), [Collin Burns](http://collinpburns.com), [Steven Basart](https://stevenbas.art), Andy Zou, Mantas Mazeika, [Dawn Song](https://people.eecs.berkeley.edu/~dawnsong/), and [Jacob Steinhardt](https://www.stat.berkeley.edu/~jsteinhardt/) (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. A complete list of tasks: ['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions'] ### Supported Tasks and Leaderboards | Model | Authors | Humanities | Social Science | STEM | Other | Average | |------------------------------------|----------|:-------:|:-------:|:-------:|:-------:|:-------:| | [UnifiedQA](https://arxiv.org/abs/2005.00700) | Khashabi et al., 2020 | 45.6 | 56.6 | 40.2 | 54.6 | 48.9 | [GPT-3](https://arxiv.org/abs/2005.14165) (few-shot) | Brown et al., 2020 | 40.8 | 50.4 | 36.7 | 48.8 | 43.9 | [GPT-2](https://arxiv.org/abs/2005.14165) | Radford et al., 2019 | 32.8 | 33.3 | 30.2 | 33.1 | 32.4 | Random Baseline | N/A | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 ### Languages English ## Dataset Structure ### Data Instances An example from anatomy subtask looks as follows: ``` { "question": "What is the embryological origin of the hyoid bone?", "choices": ["The first pharyngeal arch", "The first and second pharyngeal arches", "The second pharyngeal arch", "The second and third pharyngeal arches"], "answer": "D" } ``` ### Data Fields - `question`: a string feature - `choices`: a list of 4 string features - `answer`: a ClassLabel feature ### Data Splits - `auxiliary_train`: auxiliary multiple-choice training questions from ARC, MC_TEST, OBQA, RACE, etc. - `dev`: 5 examples per subtask, meant for few-shot setting - `test`: there are at least 100 examples per subtask | | auxiliary_train | dev | val | test | | ----- | :------: | :-----: | :-----: | :-----: | | TOTAL | 99842 | 285 | 1531 | 14042 ## Dataset Creation ### Curation Rationale Transformer models have driven this recent progress by pretraining on massive text corpora, including all of Wikipedia, thousands of books, and numerous websites. These models consequently see extensive information about specialized topics, most of which is not assessed by existing NLP benchmarks. To bridge the gap between the wide-ranging knowledge that models see during pretraining and the existing measures of success, we introduce a new benchmark for assessing models across a diverse set of subjects that humans learn. ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [MIT License](https://github.com/hendrycks/test/blob/master/LICENSE) ### Citation Information If you find this useful in your research, please consider citing the test and also the [ETHICS](https://arxiv.org/abs/2008.02275) dataset it draws from: ``` @article{hendryckstest2021, title={Measuring Massive Multitask Language Understanding}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} } @article{hendrycks2021ethics, title={Aligning AI With Shared Human Values}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} } ``` ### Contributions Thanks to [@andyzoujm](https://github.com/andyzoujm) for adding this dataset.

# MMLU 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准榜单](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可协议](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **代码仓库**:https://github.com/hendrycks/test - **相关论文**:https://arxiv.org/abs/2009.03300 ### 数据集概述 《测量大规模多任务语言理解能力》(Measuring Massive Multitask Language Understanding),作者为[Dan Hendrycks](https://people.eecs.berkeley.edu/~hendrycks/)、[Collin Burns](http://collinpburns.com)、[Steven Basart](https://stevenbas.art)、Andy Zou、Mantas Mazeika、[Dawn Song](https://people.eecs.berkeley.edu/~dawnsong/)以及[Jacob Steinhardt](https://www.stat.berkeley.edu/~jsteinhardt/)(发表于ICLR 2021)。 本数据集为大规模多任务测试集,涵盖多学科的多项选择题,测试范围覆盖人文社科、硬科学以及其他对部分学习者而言至关重要的领域,共计包含57项任务,涵盖初等数学、美国历史、计算机科学、法学等诸多领域。若想在该测试中取得高精度结果,模型需具备广博的世界知识与问题求解能力。 完整任务列表如下: ['抽象代数(abstract_algebra)', '解剖学(anatomy)', '天文学(astronomy)', '商业伦理(business_ethics)', '临床知识(clinical_knowledge)', '大学生物学(college_biology)', '大学化学(college_chemistry)', '大学计算机科学(college_computer_science)', '大学数学(college_mathematics)', '大学医学(college_medicine)', '大学物理(college_physics)', '计算机安全(computer_security)', '概念物理(conceptual_physics)', '计量经济学(econometrics)', '电气工程(electrical_engineering)', '初等数学(elementary_mathematics)', '形式逻辑(formal_logic)', '全球常识(global_facts)', '高中生物学(high_school_biology)', '高中化学(high_school_chemistry)', '高中计算机科学(high_school_computer_science)', '高中欧洲历史(high_school_european_history)', '高中地理(high_school_geography)', '高中政府与政治(high_school_government_and_politics)', '高中宏观经济学(high_school_macroeconomics)', '高中数学(high_school_mathematics)', '高中微观经济学(high_school_microeconomics)', '高中物理(high_school_physics)', '高中心理学(high_school_psychology)', '高中统计学(high_school_statistics)', '高中美国历史(high_school_us_history)', '高中世界历史(high_school_world_history)', '人体衰老(human_aging)', '人类性学(human_sexuality)', '国际法(international_law)', '法理学(jurisprudence)', '逻辑谬误(logical_fallacies)', '机器学习(machine_learning)', '管理学(management)', '市场营销学(marketing)', '医学遗传学(medical_genetics)', '综合杂项(miscellaneous)', '道德争端(moral_disputes)', '道德情境(moral_scenarios)', '营养学(nutrition)', '哲学(philosophy)', '史前史(prehistory)', '专业会计学(professional_accounting)', '专业法学(professional_law)', '专业医学(professional_medicine)', '专业心理学(professional_psychology)', '公共关系(public_relations)', '安全研究(security_studies)', '社会学(sociology)', '美国外交政策(us_foreign_policy)', '病毒学(virology)', '世界宗教(world_religions)'] ### 支持任务与基准榜单 | 模型 | 作者团队 | 人文社科 | 社会科学 | 工程与科学(STEM) | 其他 | 平均分 | |------------------------------------|----------|:-------:|:-------:|:-------:|:-------:|:-------:| | [UnifiedQA](https://arxiv.org/abs/2005.00700) | Khashabi等人,2020 | 45.6 | 56.6 | 40.2 | 54.6 | 48.9 | [GPT-3](https://arxiv.org/abs/2005.14165)(少样本/Few-shot) | Brown等人,2020 | 40.8 | 50.4 | 36.7 | 48.8 | 43.9 | [GPT-2](https://arxiv.org/abs/2005.14165) | Radford等人,2019 | 32.8 | 33.3 | 30.2 | 33.1 | 32.4 | 随机基准(Random Baseline) | 无 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 ### 语言 英语 ## 数据集结构 ### 数据实例 以下为解剖学子任务的一条数据实例: { "question": "舌骨的胚胎学起源是什么?", "choices": ["第一鳃弓", "第一与第二鳃弓", "第二鳃弓", "第二与第三鳃弓"], "answer": "D" } ### 数据字段 - `question`:字符串类型特征 - `choices`:包含4个字符串特征的列表 - `answer`:分类标签(ClassLabel)特征 ### 数据划分 - `auxiliary_train`:辅助训练集,源自ARC、MC_TEST、OBQA、RACE等数据集的多项选择题训练数据 - `dev`:开发集,每个子任务包含5条示例,用于少样本(Few-shot)设置 - `test`:测试集,每个子任务至少包含100条示例 | | 辅助训练集 | 开发集 | 验证集 | 测试集 | | ----- | :------: | :-----: | :-----: | :-----: | | 总计 | 99842 | 285 | 1531 | 14042 | ## 数据集构建 ### 构建初衷 Transformer (Transformer) 模型凭借在海量文本语料库上的预训练取得了近期的进展,这些语料库包括全部维基百科内容、数千本图书以及众多网站资源。因此,这些模型会接触到大量专业主题的信息,而其中多数信息并未在现有自然语言处理(NLP)基准测试中得到评估。为了填补模型预训练阶段接触的广泛知识与现有性能评估指标之间的缺口,我们推出了全新的基准测试,用于评估模型在人类学习的多样化学科领域上的表现。 ### 源数据 #### 初始数据收集与标准化 [需要更多相关信息] #### 源语言生成者是谁? [需要更多相关信息] ### 标注信息 #### 标注流程 [需要更多相关信息] #### 标注人员是谁? [需要更多相关信息] ### 个人与敏感信息 [需要更多相关信息] ## 数据使用注意事项 ### 数据集的社会影响 [需要更多相关信息] ### 偏差讨论 [需要更多相关信息] ### 其他已知局限性 [需要更多相关信息] ## 附加信息 ### 数据集维护者 [需要更多相关信息] ### 许可协议 [MIT许可证(MIT License)](https://github.com/hendrycks/test/blob/master/LICENSE) ### 引用信息 若您在研究中使用本数据集,请引用本测试基准以及其所依托的ETHICS数据集: @article{hendryckstest2021, title={测量大规模多任务语言理解能力}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt}, journal={国际学习表征会议(ICLR)论文集}, year={2021} } @article{hendrycks2021ethics, title={使人工智能与人类共同价值观对齐}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt}, journal={国际学习表征会议(ICLR)论文集}, year={2021} } ### 贡献者 感谢[@andyzoujm](https://github.com/andyzoujm) 为本数据集的收录提供支持。
提供机构:
maas
创建时间:
2024-05-13
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
MMLU数据集是一个用于评估语言理解能力的大规模多任务测试,包含来自57个不同知识领域的多项选择题,涵盖人文、社会科学、硬科学等广泛主题。该数据集旨在通过多领域知识测试,衡量模型的世界知识和问题解决能力,数据划分包括训练、开发和测试集,总实例数超过11万。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作