cais/mmlu
收藏Hugging Face2024-03-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/cais/mmlu
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- expert-generated
language:
- en
license:
- mit
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- question-answering
task_ids:
- multiple-choice-qa
paperswithcode_id: mmlu
pretty_name: Measuring Massive Multitask Language Understanding
language_bcp47:
- en-US
dataset_info:
- config_name: abstract_algebra
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 49618.6654322746
num_examples: 100
- name: validation
num_bytes: 5485.515349444808
num_examples: 11
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 17143
dataset_size: 57303.3562203159
- config_name: all
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 6967453
num_examples: 14042
- name: validation
num_bytes: 763484
num_examples: 1531
- name: dev
num_bytes: 125353
num_examples: 285
- name: auxiliary_train
num_bytes: 161000625
num_examples: 99842
download_size: 51503402
dataset_size: 168856915
- config_name: anatomy
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 66985.19833357072
num_examples: 135
- name: validation
num_bytes: 6981.5649902024825
num_examples: 14
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 28864
dataset_size: 76165.9387623697
- config_name: astronomy
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 75420.3714570574
num_examples: 152
- name: validation
num_bytes: 7978.931417374265
num_examples: 16
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 39316
dataset_size: 85598.47831302814
- config_name: auxiliary_train
features:
- name: train
struct:
- name: answer
dtype: int64
- name: choices
sequence: string
- name: question
dtype: string
- name: subject
dtype: string
splits:
- name: train
num_bytes: 161000625
num_examples: 99842
download_size: 47518592
dataset_size: 161000625
- config_name: business_ethics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 49618.6654322746
num_examples: 100
- name: validation
num_bytes: 5485.515349444808
num_examples: 11
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 31619
dataset_size: 57303.3562203159
- config_name: clinical_knowledge
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 131489.4633955277
num_examples: 265
- name: validation
num_bytes: 14461.813193990856
num_examples: 29
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 51655
dataset_size: 148150.45202811505
- config_name: college_biology
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 71450.87822247542
num_examples: 144
- name: validation
num_bytes: 7978.931417374265
num_examples: 16
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 43017
dataset_size: 81628.98507844617
- config_name: college_chemistry
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 49618.6654322746
num_examples: 100
- name: validation
num_bytes: 3989.4657086871325
num_examples: 8
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 26781
dataset_size: 55807.30657955822
- config_name: college_computer_science
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 49618.6654322746
num_examples: 100
- name: validation
num_bytes: 5485.515349444808
num_examples: 11
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 41132
dataset_size: 57303.3562203159
- config_name: college_mathematics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 49618.6654322746
num_examples: 100
- name: validation
num_bytes: 5485.515349444808
num_examples: 11
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 26779
dataset_size: 57303.3562203159
- config_name: college_medicine
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 85840.29119783506
num_examples: 173
- name: validation
num_bytes: 10971.030698889615
num_examples: 22
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 56303
dataset_size: 99010.49733532117
- config_name: college_physics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 50611.0387409201
num_examples: 102
- name: validation
num_bytes: 5485.515349444808
num_examples: 11
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 29539
dataset_size: 58295.7295289614
- config_name: computer_security
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 49618.6654322746
num_examples: 100
- name: validation
num_bytes: 5485.515349444808
num_examples: 11
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 30150
dataset_size: 57303.3562203159
- config_name: conceptual_physics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 116603.86376584532
num_examples: 235
- name: validation
num_bytes: 12965.76355323318
num_examples: 26
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 34968
dataset_size: 131768.802757675
- config_name: econometrics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 56565.27859279305
num_examples: 114
- name: validation
num_bytes: 5984.198563030699
num_examples: 12
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 36040
dataset_size: 64748.652594420244
- config_name: electrical_engineering
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 71947.06487679818
num_examples: 145
- name: validation
num_bytes: 7978.931417374265
num_examples: 16
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 26746
dataset_size: 82125.17173276893
- config_name: elementary_mathematics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 187558.555333998
num_examples: 378
- name: validation
num_bytes: 20446.011757021555
num_examples: 41
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 54987
dataset_size: 210203.74252961605
- config_name: formal_logic
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 62519.518444666
num_examples: 126
- name: validation
num_bytes: 6981.5649902024825
num_examples: 14
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 32884
dataset_size: 71700.25887346498
- config_name: global_facts
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 49618.6654322746
num_examples: 100
- name: validation
num_bytes: 4986.8321358589155
num_examples: 10
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 19258
dataset_size: 56804.67300673001
- config_name: high_school_biology
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 153817.86284005127
num_examples: 310
- name: validation
num_bytes: 15957.86283474853
num_examples: 32
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 78216
dataset_size: 171974.90111339628
- config_name: high_school_chemistry
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 100725.89082751745
num_examples: 203
- name: validation
num_bytes: 10971.030698889615
num_examples: 22
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 45799
dataset_size: 113896.09696500355
- config_name: high_school_computer_science
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 49618.6654322746
num_examples: 100
- name: validation
num_bytes: 4488.148922273024
num_examples: 9
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 39072
dataset_size: 56305.989793144116
- config_name: high_school_european_history
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 81870.79796325309
num_examples: 165
- name: validation
num_bytes: 8976.297844546049
num_examples: 18
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 196270
dataset_size: 93046.27124639563
- config_name: high_school_geography
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 98244.95755590372
num_examples: 198
- name: validation
num_bytes: 10971.030698889615
num_examples: 22
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 38255
dataset_size: 111415.16369338983
- config_name: high_school_government_and_politics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 95764.02428428999
num_examples: 193
- name: validation
num_bytes: 10472.347485303722
num_examples: 21
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 52963
dataset_size: 108435.5472081902
- config_name: high_school_macroeconomics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 193512.79518587096
num_examples: 390
- name: validation
num_bytes: 21443.378184193338
num_examples: 43
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 68758
dataset_size: 217155.34880866078
- config_name: high_school_mathematics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 133970.39666714144
num_examples: 270
- name: validation
num_bytes: 14461.813193990856
num_examples: 29
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 45210
dataset_size: 150631.38529972878
- config_name: high_school_microeconomics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 118092.42372881356
num_examples: 238
- name: validation
num_bytes: 12965.76355323318
num_examples: 26
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 49885
dataset_size: 133257.36272064323
- config_name: high_school_physics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 74924.18480273466
num_examples: 151
- name: validation
num_bytes: 8477.614630960157
num_examples: 17
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 45483
dataset_size: 85600.9748722913
- config_name: high_school_psychology
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 270421.7266058966
num_examples: 545
- name: validation
num_bytes: 29920.992815153495
num_examples: 60
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 113158
dataset_size: 302541.8948596466
- config_name: high_school_statistics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 107176.31733371314
num_examples: 216
- name: validation
num_bytes: 11469.713912475507
num_examples: 23
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 74924
dataset_size: 120845.20668478514
- config_name: high_school_us_history
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 101222.0774818402
num_examples: 204
- name: validation
num_bytes: 10971.030698889615
num_examples: 22
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 200043
dataset_size: 114392.2836193263
- config_name: high_school_world_history
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 117596.23707449081
num_examples: 237
- name: validation
num_bytes: 12965.76355323318
num_examples: 26
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 250302
dataset_size: 132761.17606632048
- config_name: human_aging
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 110649.62391397236
num_examples: 223
- name: validation
num_bytes: 11469.713912475507
num_examples: 23
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 41196
dataset_size: 124318.51326504436
- config_name: human_sexuality
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 65000.451716279735
num_examples: 131
- name: validation
num_bytes: 5984.198563030699
num_examples: 12
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 32533
dataset_size: 73183.82571790692
- config_name: international_law
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 60038.58517305227
num_examples: 121
- name: validation
num_bytes: 6482.88177661659
num_examples: 13
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 41592
dataset_size: 68720.64238826535
- config_name: jurisprudence
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 53588.15866685657
num_examples: 108
- name: validation
num_bytes: 5485.515349444808
num_examples: 11
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 33578
dataset_size: 61272.84945489787
- config_name: logical_fallacies
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 80878.4246546076
num_examples: 163
- name: validation
num_bytes: 8976.297844546049
num_examples: 18
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 33669
dataset_size: 92053.89793775014
- config_name: machine_learning
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 55572.90528414756
num_examples: 112
- name: validation
num_bytes: 5485.515349444808
num_examples: 11
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 31121
dataset_size: 63257.596072188855
- config_name: management
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 51107.225395242844
num_examples: 103
- name: validation
num_bytes: 5485.515349444808
num_examples: 11
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 22828
dataset_size: 58791.91618328414
- config_name: marketing
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 116107.67711152257
num_examples: 234
- name: validation
num_bytes: 12467.08033964729
num_examples: 25
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 49747
dataset_size: 130773.93288976635
- config_name: medical_genetics
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 49618.6654322746
num_examples: 100
- name: validation
num_bytes: 5485.515349444808
num_examples: 11
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 25775
dataset_size: 57303.3562203159
- config_name: miscellaneous
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 388514.15033471014
num_examples: 783
- name: validation
num_bytes: 42886.756368386676
num_examples: 86
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 115097
dataset_size: 433600.08214169333
- config_name: moral_disputes
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 171680.58239567012
num_examples: 346
- name: validation
num_bytes: 18949.96211626388
num_examples: 38
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 76043
dataset_size: 192829.71995053047
- config_name: moral_scenarios
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 444087.05561885773
num_examples: 895
- name: validation
num_bytes: 49868.32135858916
num_examples: 100
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 109869
dataset_size: 496154.5524160434
- config_name: nutrition
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 151833.1162227603
num_examples: 306
- name: validation
num_bytes: 16456.54604833442
num_examples: 33
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 69050
dataset_size: 170488.8377096912
- config_name: philosophy
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 154314.04949437402
num_examples: 311
- name: validation
num_bytes: 16955.229261920314
num_examples: 34
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 61912
dataset_size: 173468.45419489083
- config_name: prehistory
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 160764.47600056973
num_examples: 324
- name: validation
num_bytes: 17453.912475506204
num_examples: 35
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 68826
dataset_size: 180417.5639146724
- config_name: professional_accounting
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 139924.6365190144
num_examples: 282
- name: validation
num_bytes: 15459.179621162639
num_examples: 31
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 87297
dataset_size: 157582.99157877354
- config_name: professional_law
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 761150.3277310925
num_examples: 1534
- name: validation
num_bytes: 84776.14630960157
num_examples: 170
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 1167828
dataset_size: 848125.6494792906
- config_name: professional_medicine
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 134962.7699757869
num_examples: 272
- name: validation
num_bytes: 15459.179621162639
num_examples: 31
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 153242
dataset_size: 152621.12503554605
- config_name: professional_psychology
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 303666.2324455206
num_examples: 612
- name: validation
num_bytes: 34409.14173742652
num_examples: 69
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 159357
dataset_size: 340274.5496215436
- config_name: public_relations
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 54580.53197550207
num_examples: 110
- name: validation
num_bytes: 5984.198563030699
num_examples: 12
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 31500
dataset_size: 62763.90597712925
- config_name: security_studies
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 121565.73030907278
num_examples: 245
- name: validation
num_bytes: 13464.446766819072
num_examples: 27
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 140258
dataset_size: 137229.35251448833
- config_name: sociology
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 99733.51751887196
num_examples: 201
- name: validation
num_bytes: 10971.030698889615
num_examples: 22
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 56480
dataset_size: 112903.72365635807
- config_name: us_foreign_policy
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 49618.6654322746
num_examples: 100
- name: validation
num_bytes: 5485.515349444808
num_examples: 11
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 29027
dataset_size: 57303.3562203159
- config_name: virology
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 82366.98461757584
num_examples: 166
- name: validation
num_bytes: 8976.297844546049
num_examples: 18
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 38229
dataset_size: 93542.45790071838
- config_name: world_religions
features:
- name: question
dtype: string
- name: subject
dtype: string
- name: choices
sequence: string
- name: answer
dtype:
class_label:
names:
'0': A
'1': B
'2': C
'3': D
splits:
- name: test
num_bytes: 84847.91788918957
num_examples: 171
- name: validation
num_bytes: 9474.98105813194
num_examples: 19
- name: dev
num_bytes: 2199.1754385964914
num_examples: 5
download_size: 27165
dataset_size: 96522.07438591801
configs:
- config_name: abstract_algebra
data_files:
- split: test
path: abstract_algebra/test-*
- split: validation
path: abstract_algebra/validation-*
- split: dev
path: abstract_algebra/dev-*
- config_name: all
data_files:
- split: test
path: all/test-*
- split: validation
path: all/validation-*
- split: dev
path: all/dev-*
- split: auxiliary_train
path: all/auxiliary_train-*
- config_name: anatomy
data_files:
- split: test
path: anatomy/test-*
- split: validation
path: anatomy/validation-*
- split: dev
path: anatomy/dev-*
- config_name: astronomy
data_files:
- split: test
path: astronomy/test-*
- split: validation
path: astronomy/validation-*
- split: dev
path: astronomy/dev-*
- config_name: auxiliary_train
data_files:
- split: train
path: auxiliary_train/train-*
- config_name: business_ethics
data_files:
- split: test
path: business_ethics/test-*
- split: validation
path: business_ethics/validation-*
- split: dev
path: business_ethics/dev-*
- config_name: clinical_knowledge
data_files:
- split: test
path: clinical_knowledge/test-*
- split: validation
path: clinical_knowledge/validation-*
- split: dev
path: clinical_knowledge/dev-*
- config_name: college_biology
data_files:
- split: test
path: college_biology/test-*
- split: validation
path: college_biology/validation-*
- split: dev
path: college_biology/dev-*
- config_name: college_chemistry
data_files:
- split: test
path: college_chemistry/test-*
- split: validation
path: college_chemistry/validation-*
- split: dev
path: college_chemistry/dev-*
- config_name: college_computer_science
data_files:
- split: test
path: college_computer_science/test-*
- split: validation
path: college_computer_science/validation-*
- split: dev
path: college_computer_science/dev-*
- config_name: college_mathematics
data_files:
- split: test
path: college_mathematics/test-*
- split: validation
path: college_mathematics/validation-*
- split: dev
path: college_mathematics/dev-*
- config_name: college_medicine
data_files:
- split: test
path: college_medicine/test-*
- split: validation
path: college_medicine/validation-*
- split: dev
path: college_medicine/dev-*
- config_name: college_physics
data_files:
- split: test
path: college_physics/test-*
- split: validation
path: college_physics/validation-*
- split: dev
path: college_physics/dev-*
- config_name: computer_security
data_files:
- split: test
path: computer_security/test-*
- split: validation
path: computer_security/validation-*
- split: dev
path: computer_security/dev-*
- config_name: conceptual_physics
data_files:
- split: test
path: conceptual_physics/test-*
- split: validation
path: conceptual_physics/validation-*
- split: dev
path: conceptual_physics/dev-*
- config_name: econometrics
data_files:
- split: test
path: econometrics/test-*
- split: validation
path: econometrics/validation-*
- split: dev
path: econometrics/dev-*
- config_name: electrical_engineering
data_files:
- split: test
path: electrical_engineering/test-*
- split: validation
path: electrical_engineering/validation-*
- split: dev
path: electrical_engineering/dev-*
- config_name: elementary_mathematics
data_files:
- split: test
path: elementary_mathematics/test-*
- split: validation
path: elementary_mathematics/validation-*
- split: dev
path: elementary_mathematics/dev-*
- config_name: formal_logic
data_files:
- split: test
path: formal_logic/test-*
- split: validation
path: formal_logic/validation-*
- split: dev
path: formal_logic/dev-*
- config_name: global_facts
data_files:
- split: test
path: global_facts/test-*
- split: validation
path: global_facts/validation-*
- split: dev
path: global_facts/dev-*
- config_name: high_school_biology
data_files:
- split: test
path: high_school_biology/test-*
- split: validation
path: high_school_biology/validation-*
- split: dev
path: high_school_biology/dev-*
- config_name: high_school_chemistry
data_files:
- split: test
path: high_school_chemistry/test-*
- split: validation
path: high_school_chemistry/validation-*
- split: dev
path: high_school_chemistry/dev-*
- config_name: high_school_computer_science
data_files:
- split: test
path: high_school_computer_science/test-*
- split: validation
path: high_school_computer_science/validation-*
- split: dev
path: high_school_computer_science/dev-*
- config_name: high_school_european_history
data_files:
- split: test
path: high_school_european_history/test-*
- split: validation
path: high_school_european_history/validation-*
- split: dev
path: high_school_european_history/dev-*
- config_name: high_school_geography
data_files:
- split: test
path: high_school_geography/test-*
- split: validation
path: high_school_geography/validation-*
- split: dev
path: high_school_geography/dev-*
- config_name: high_school_government_and_politics
data_files:
- split: test
path: high_school_government_and_politics/test-*
- split: validation
path: high_school_government_and_politics/validation-*
- split: dev
path: high_school_government_and_politics/dev-*
- config_name: high_school_macroeconomics
data_files:
- split: test
path: high_school_macroeconomics/test-*
- split: validation
path: high_school_macroeconomics/validation-*
- split: dev
path: high_school_macroeconomics/dev-*
- config_name: high_school_mathematics
data_files:
- split: test
path: high_school_mathematics/test-*
- split: validation
path: high_school_mathematics/validation-*
- split: dev
path: high_school_mathematics/dev-*
- config_name: high_school_microeconomics
data_files:
- split: test
path: high_school_microeconomics/test-*
- split: validation
path: high_school_microeconomics/validation-*
- split: dev
path: high_school_microeconomics/dev-*
- config_name: high_school_physics
data_files:
- split: test
path: high_school_physics/test-*
- split: validation
path: high_school_physics/validation-*
- split: dev
path: high_school_physics/dev-*
- config_name: high_school_psychology
data_files:
- split: test
path: high_school_psychology/test-*
- split: validation
path: high_school_psychology/validation-*
- split: dev
path: high_school_psychology/dev-*
- config_name: high_school_statistics
data_files:
- split: test
path: high_school_statistics/test-*
- split: validation
path: high_school_statistics/validation-*
- split: dev
path: high_school_statistics/dev-*
- config_name: high_school_us_history
data_files:
- split: test
path: high_school_us_history/test-*
- split: validation
path: high_school_us_history/validation-*
- split: dev
path: high_school_us_history/dev-*
- config_name: high_school_world_history
data_files:
- split: test
path: high_school_world_history/test-*
- split: validation
path: high_school_world_history/validation-*
- split: dev
path: high_school_world_history/dev-*
- config_name: human_aging
data_files:
- split: test
path: human_aging/test-*
- split: validation
path: human_aging/validation-*
- split: dev
path: human_aging/dev-*
- config_name: human_sexuality
data_files:
- split: test
path: human_sexuality/test-*
- split: validation
path: human_sexuality/validation-*
- split: dev
path: human_sexuality/dev-*
- config_name: international_law
data_files:
- split: test
path: international_law/test-*
- split: validation
path: international_law/validation-*
- split: dev
path: international_law/dev-*
- config_name: jurisprudence
data_files:
- split: test
path: jurisprudence/test-*
- split: validation
path: jurisprudence/validation-*
- split: dev
path: jurisprudence/dev-*
- config_name: logical_fallacies
data_files:
- split: test
path: logical_fallacies/test-*
- split: validation
path: logical_fallacies/validation-*
- split: dev
path: logical_fallacies/dev-*
- config_name: machine_learning
data_files:
- split: test
path: machine_learning/test-*
- split: validation
path: machine_learning/validation-*
- split: dev
path: machine_learning/dev-*
- config_name: management
data_files:
- split: test
path: management/test-*
- split: validation
path: management/validation-*
- split: dev
path: management/dev-*
- config_name: marketing
data_files:
- split: test
path: marketing/test-*
- split: validation
path: marketing/validation-*
- split: dev
path: marketing/dev-*
- config_name: medical_genetics
data_files:
- split: test
path: medical_genetics/test-*
- split: validation
path: medical_genetics/validation-*
- split: dev
path: medical_genetics/dev-*
- config_name: miscellaneous
data_files:
- split: test
path: miscellaneous/test-*
- split: validation
path: miscellaneous/validation-*
- split: dev
path: miscellaneous/dev-*
- config_name: moral_disputes
data_files:
- split: test
path: moral_disputes/test-*
- split: validation
path: moral_disputes/validation-*
- split: dev
path: moral_disputes/dev-*
- config_name: moral_scenarios
data_files:
- split: test
path: moral_scenarios/test-*
- split: validation
path: moral_scenarios/validation-*
- split: dev
path: moral_scenarios/dev-*
- config_name: nutrition
data_files:
- split: test
path: nutrition/test-*
- split: validation
path: nutrition/validation-*
- split: dev
path: nutrition/dev-*
- config_name: philosophy
data_files:
- split: test
path: philosophy/test-*
- split: validation
path: philosophy/validation-*
- split: dev
path: philosophy/dev-*
- config_name: prehistory
data_files:
- split: test
path: prehistory/test-*
- split: validation
path: prehistory/validation-*
- split: dev
path: prehistory/dev-*
- config_name: professional_accounting
data_files:
- split: test
path: professional_accounting/test-*
- split: validation
path: professional_accounting/validation-*
- split: dev
path: professional_accounting/dev-*
- config_name: professional_law
data_files:
- split: test
path: professional_law/test-*
- split: validation
path: professional_law/validation-*
- split: dev
path: professional_law/dev-*
- config_name: professional_medicine
data_files:
- split: test
path: professional_medicine/test-*
- split: validation
path: professional_medicine/validation-*
- split: dev
path: professional_medicine/dev-*
- config_name: professional_psychology
data_files:
- split: test
path: professional_psychology/test-*
- split: validation
path: professional_psychology/validation-*
- split: dev
path: professional_psychology/dev-*
- config_name: public_relations
data_files:
- split: test
path: public_relations/test-*
- split: validation
path: public_relations/validation-*
- split: dev
path: public_relations/dev-*
- config_name: security_studies
data_files:
- split: test
path: security_studies/test-*
- split: validation
path: security_studies/validation-*
- split: dev
path: security_studies/dev-*
- config_name: sociology
data_files:
- split: test
path: sociology/test-*
- split: validation
path: sociology/validation-*
- split: dev
path: sociology/dev-*
- config_name: us_foreign_policy
data_files:
- split: test
path: us_foreign_policy/test-*
- split: validation
path: us_foreign_policy/validation-*
- split: dev
path: us_foreign_policy/dev-*
- config_name: virology
data_files:
- split: test
path: virology/test-*
- split: validation
path: virology/validation-*
- split: dev
path: virology/dev-*
- config_name: world_religions
data_files:
- split: test
path: world_religions/test-*
- split: validation
path: world_religions/validation-*
- split: dev
path: world_religions/dev-*
---
# Dataset Card for MMLU
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Repository**: https://github.com/hendrycks/test
- **Paper**: https://arxiv.org/abs/2009.03300
### Dataset Summary
[Measuring Massive Multitask Language Understanding](https://arxiv.org/pdf/2009.03300) by [Dan Hendrycks](https://people.eecs.berkeley.edu/~hendrycks/), [Collin Burns](http://collinpburns.com), [Steven Basart](https://stevenbas.art), Andy Zou, Mantas Mazeika, [Dawn Song](https://people.eecs.berkeley.edu/~dawnsong/), and [Jacob Steinhardt](https://www.stat.berkeley.edu/~jsteinhardt/) (ICLR 2021).
This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability.
A complete list of tasks: ['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']
### Supported Tasks and Leaderboards
| Model | Authors | Humanities | Social Science | STEM | Other | Average |
|------------------------------------|----------|:-------:|:-------:|:-------:|:-------:|:-------:|
| [UnifiedQA](https://arxiv.org/abs/2005.00700) | Khashabi et al., 2020 | 45.6 | 56.6 | 40.2 | 54.6 | 48.9
| [GPT-3](https://arxiv.org/abs/2005.14165) (few-shot) | Brown et al., 2020 | 40.8 | 50.4 | 36.7 | 48.8 | 43.9
| [GPT-2](https://arxiv.org/abs/2005.14165) | Radford et al., 2019 | 32.8 | 33.3 | 30.2 | 33.1 | 32.4
| Random Baseline | N/A | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0
### Languages
English
## Dataset Structure
### Data Instances
An example from anatomy subtask looks as follows:
```
{
"question": "What is the embryological origin of the hyoid bone?",
"choices": ["The first pharyngeal arch", "The first and second pharyngeal arches", "The second pharyngeal arch", "The second and third pharyngeal arches"],
"answer": "D"
}
```
### Data Fields
- `question`: a string feature
- `choices`: a list of 4 string features
- `answer`: a ClassLabel feature
### Data Splits
- `auxiliary_train`: auxiliary multiple-choice training questions from ARC, MC_TEST, OBQA, RACE, etc.
- `dev`: 5 examples per subtask, meant for few-shot setting
- `test`: there are at least 100 examples per subtask
| | auxiliary_train | dev | val | test |
| ----- | :------: | :-----: | :-----: | :-----: |
| TOTAL | 99842 | 285 | 1531 | 14042
## Dataset Creation
### Curation Rationale
Transformer models have driven this recent progress by pretraining on massive text corpora, including all of Wikipedia, thousands of books, and numerous websites. These models consequently see extensive information about specialized topics, most of which is not assessed by existing NLP benchmarks. To bridge the gap between the wide-ranging knowledge that models see during pretraining and the existing measures of success, we introduce a new benchmark for assessing models across a diverse set of subjects that humans learn.
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[MIT License](https://github.com/hendrycks/test/blob/master/LICENSE)
### Citation Information
If you find this useful in your research, please consider citing the test and also the [ETHICS](https://arxiv.org/abs/2008.02275) dataset it draws from:
```
@article{hendryckstest2021,
title={Measuring Massive Multitask Language Understanding},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
@article{hendrycks2021ethics,
title={Aligning AI With Shared Human Values},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
```
### Contributions
Thanks to [@andyzoujm](https://github.com/andyzoujm) for adding this dataset.
### 数据集元数据
- 标注创建者:无标注(no-annotation)
- 语言创建者:专家生成(expert-generated)
- 语言:英语(en)
- 许可证:MIT许可证(mit)
- 多语言属性:单语言(monolingual)
- 样本规模:10000 < n < 100000
- 源数据集:原创数据集(original)
- 任务类别:问答(question-answering)
- 任务子类别:多项选择问答(multiple-choice-qa)
- PapersWithCode ID:mmlu
- 展示名称:大规模多任务语言理解测评(Measuring Massive Multitask Language Understanding)
- 语言BCP47标签:en-US
## 数据集配置详情
本数据集包含多个学科专属的任务配置,通用结构如下:
- 配置名称:[学科名称]([英文配置名])
特征字段:
- `question`:字符串类型,存储试题题干
- `subject`:字符串类型,存储试题所属学科
- `choices`:字符串序列,包含4个候选选项
- `answer`:类别标签特征,类别映射关系为:'0': A, '1': B, '2': C, '3': D
数据划分:
- 测试集(test):包含对应学科的测评样本
- 验证集(validation):用于模型验证的样本
- 开发集(dev):每个子任务固定包含5个样本,用于少样本学习场景
下载大小:对应配置的数据集下载体积
数据集总大小:对应配置的全部数据体积
完整任务配置列表对应前文的57项学科任务。
## MMLU 数据集卡片
### 目录
- 目录
- 数据集描述
- 数据集概览
- 支持任务与评测基准
- 语言
- 数据集结构
- 数据样例
- 数据字段
- 数据划分
- 数据集构建
- 构建初衷
- 源数据
- 标注信息
- 个人与敏感信息
- 数据集使用注意事项
- 数据集的社会影响
- 偏差讨论
- 其他已知局限性
- 附加信息
- 数据集维护者
- 许可信息
- 引用信息
- 贡献
## 数据集描述
- **代码仓库**:https://github.com/hendrycks/test
- **论文链接**:https://arxiv.org/abs/2009.03300
### 数据集概览
本数据集为《大规模多任务语言理解测评(Measuring Massive Multitask Language Understanding)》,由Dan Hendrycks、Collin Burns、Steven Basart、Andy Zou、Mantas Mazeika、Dawn Song以及Jacob Steinhardt共同完成,发表于2021年国际学习表征会议(ICLR 2021)。
这是一个大规模多任务测评集,包含来自多个知识分支的多项选择题。该测评覆盖人文社科、自然科学及其他大众应知的多个领域,共计57项任务,包括初等数学、美国历史、计算机科学、法学等。若要在该测评中取得高准确率,模型需具备广博的世界知识与问题求解能力。
完整任务列表如下:
['抽象代数', '解剖学', '天文学', '商业伦理', '临床知识', '大学基础生物学', '大学化学', '大学计算机科学', '大学数学', '大学医学', '大学物理', '计算机安全', '概念物理', '计量经济学', '电气工程', '初等数学', '形式逻辑', '全球常识', '高中生物学', '高中化学', '高中计算机科学', '高中欧洲历史', '高中地理学', '高中政府与政治学', '高中宏观经济学', '高中数学', '高中微观经济学', '高中物理学', '高中心理学', '高中统计学', '美国高中历史', '高中世界历史', '人类衰老', '人类性学', '国际法', '法理学', '逻辑谬误', '机器学习', '管理学', '市场营销学', '医学遗传学', '综合杂项', '道德争议', '道德情境', '营养学', '哲学', '史前史', '专业会计学', '专业法学', '专业医学', '专业心理学', '公共关系学', '安全研究', '社会学', '美国外交政策', '病毒学', '世界宗教']
### 支持任务与评测基准
| 模型 | 作者 | 人文社科 | 社会科学 | STEM | 其他 | 平均得分 |
|------------------------------------|----------|:-------:|:-------:|:-------:|:-------:|:-------:|
| [统一问答(UnifiedQA)](https://arxiv.org/abs/2005.00700) | Khashabi等人,2020 | 45.6 | 56.6 | 40.2 | 54.6 | 48.9
| [GPT-3(少样本)](https://arxiv.org/abs/2005.14165) | Brown等人,2020 | 40.8 | 50.4 | 36.7 | 48.8 | 43.9
| [GPT-2](https://arxiv.org/abs/2005.14165) | Radford等人,2019 | 32.8 | 33.3 | 30.2 | 33.1 | 32.4
| 随机基线 | 无 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0
### 语言
英语
## 数据集结构
### 数据样例
以下为解剖学子任务的一个样例:
{
"question": "舌骨的胚胎学起源是什么?",
"choices": ["第一鳃弓", "第一和第二鳃弓", "第二鳃弓", "第二和第三鳃弓"],
"answer": "D"
}
### 数据字段
- `question`:字符串类型特征,存储试题题干
- `choices`:包含4个字符串的列表特征,存储所有候选选项
- `answer`:类别标签特征,标识正确选项对应的字母
### 数据划分
- `auxiliary_train`:来自ARC、MC_TEST、OBQA、RACE等数据集的辅助多项选择题训练样本
- `dev`:每个子任务包含5个样本,用于少样本学习场景
- `test`:每个子任务至少包含100个样本
| | 辅助训练集 | 开发集 | 验证集 | 测试集 |
| ----- | :------: | :-----: | :-----: | :-----: |
| 总计 | 99842 | 285 | 1531 | 14042
## 数据集构建
### 构建初衷
Transformer模型(Transformer)通过在大规模文本语料库上预训练实现了近期的性能突破,这些语料库包括全部维基百科内容、数千本图书及海量网页。因此,这些模型会接触到大量专业领域信息,但现有自然语言处理基准大多未覆盖这些内容。为了弥合模型预训练阶段习得的广泛知识与现有性能评估指标之间的差距,我们推出了这一新基准,用于测评模型在人类学习过的多样化学科上的表现。
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言生产者是谁?
[需补充更多信息]
### 标注信息
#### 标注流程
[需补充更多信息]
#### 标注者是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 许可信息
本数据集采用MIT许可(MIT License),详见https://github.com/hendrycks/test/blob/master/LICENSE
### 引用信息
若您在研究中使用本数据集,请引用该测评相关论文以及其借鉴的[ETHICS](https://arxiv.org/abs/2008.02275)数据集:
@article{hendryckstest2021,
title={Measuring Massive Multitask Language Understanding},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
@article{hendrycks2021ethics,
title={Aligning AI With Shared Human Values},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
### 贡献
感谢[@andyzoujm](https://github.com/andyzoujm) 为本数据集提供支持。
提供机构:
cais
原始信息汇总
数据集概述
基本信息
- 语言: 英语 (en)
- 许可证: MIT
- 多语言性: 单语种
- 大小范围: 10K<n<100K
- 数据来源: 原始数据
- 任务类别: 问答
- 任务ID: 多选题问答 (multiple-choice-qa)
- 论文代码ID: mmlu
- 美观名称: 测量大规模多任务语言理解
数据集结构
特征
- 问题 (question): 字符串类型
- 主题 (subject): 字符串类型
- 选项 (choices): 字符串序列类型
- 答案 (answer): 分类标签类型,选项为A, B, C, D
分割
- 测试集 (test): 不同配置下的示例数和字节数不同
- 验证集 (validation): 不同配置下的示例数和字节数不同
- 开发集 (dev): 不同配置下的示例数和字节数不同
- 辅助训练集 (auxiliary_train): 不同配置下的示例数和字节数不同
数据集大小
- 下载大小: 不同配置下的下载大小不同
- 数据集大小: 不同配置下的数据集大小不同
配置详情
配置: abstract_algebra
- 测试集: 100个示例,49618.6654322746字节
- 验证集: 11个示例,5485.515349444808字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 17143字节
- 数据集大小: 57303.3562203159字节
配置: all
- 测试集: 14042个示例,6967453字节
- 验证集: 1531个示例,763484字节
- 开发集: 285个示例,125353字节
- 辅助训练集: 99842个示例,161000625字节
- 下载大小: 51503402字节
- 数据集大小: 168856915字节
配置: anatomy
- 测试集: 135个示例,66985.19833357072字节
- 验证集: 14个示例,6981.5649902024825字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 28864字节
- 数据集大小: 76165.9387623697字节
配置: astronomy
- 测试集: 152个示例,75420.3714570574字节
- 验证集: 16个示例,7978.931417374265字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 39316字节
- 数据集大小: 85598.47831302814字节
配置: auxiliary_train
- 训练集: 99842个示例,161000625字节
- 下载大小: 47518592字节
- 数据集大小: 161000625字节
配置: business_ethics
- 测试集: 100个示例,49618.6654322746字节
- 验证集: 11个示例,5485.515349444808字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 31619字节
- 数据集大小: 57303.3562203159字节
配置: clinical_knowledge
- 测试集: 265个示例,131489.4633955277字节
- 验证集: 29个示例,14461.813193990856字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 51655字节
- 数据集大小: 148150.45202811505字节
配置: college_biology
- 测试集: 144个示例,71450.87822247542字节
- 验证集: 16个示例,7978.931417374265字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 43017字节
- 数据集大小: 81628.98507844617字节
配置: college_chemistry
- 测试集: 100个示例,49618.6654322746字节
- 验证集: 8个示例,3989.4657086871325字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 26781字节
- 数据集大小: 55807.30657955822字节
配置: college_computer_science
- 测试集: 100个示例,49618.6654322746字节
- 验证集: 11个示例,5485.515349444808字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 41132字节
- 数据集大小: 57303.3562203159字节
配置: college_mathematics
- 测试集: 100个示例,49618.6654322746字节
- 验证集: 11个示例,5485.515349444808字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 26779字节
- 数据集大小: 57303.3562203159字节
配置: college_medicine
- 测试集: 173个示例,85840.29119783506字节
- 验证集: 22个示例,10971.030698889615字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 56303字节
- 数据集大小: 99010.49733532117字节
配置: college_physics
- 测试集: 102个示例,50611.0387409201字节
- 验证集: 11个示例,5485.515349444808字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 29539字节
- 数据集大小: 58295.7295289614字节
配置: computer_security
- 测试集: 100个示例,49618.6654322746字节
- 验证集: 11个示例,5485.515349444808字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 30150字节
- 数据集大小: 57303.3562203159字节
配置: conceptual_physics
- 测试集: 235个示例,116603.86376584532字节
- 验证集: 26个示例,12965.76355323318字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 34968字节
- 数据集大小: 131768.802757675字节
配置: econometrics
- 测试集: 114个示例,56565.27859279305字节
- 验证集: 12个示例,5984.198563030699字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 36040字节
- 数据集大小: 64748.652594420244字节
配置: electrical_engineering
- 测试集: 145个示例,71947.06487679818字节
- 验证集: 16个示例,7978.931417374265字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 26746字节
- 数据集大小: 82125.17173276893字节
配置: elementary_mathematics
- 测试集: 378个示例,187558.555333998字节
- 验证集: 41个示例,20446.011757021555字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 54987字节
- 数据集大小: 210203.74252961605字节
配置: formal_logic
- 测试集: 126个示例,62519.518444666字节
- 验证集: 14个示例,6981.5649902024825字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 32884字节
- 数据集大小: 71700.25887346498字节
配置: global_facts
- 测试集: 100个示例,49618.6654322746字节
- 验证集: 10个示例,4986.8321358589155字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 19258字节
- 数据集大小: 56804.67300673001字节
配置: high_school_biology
- 测试集: 310个示例,153817.86284005127字节
- 验证集: 32个示例,15957.86283474853字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 78216字节
- 数据集大小: 171974.90111339628字节
配置: high_school_chemistry
- 测试集: 203个示例,100725.89082751745字节
- 验证集: 22个示例,10971.030698889615字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 45799字节
- 数据集大小: 113896.09696500355字节
配置: high_school_computer_science
- 测试集: 100个示例,49618.6654322746字节
- 验证集: 9个示例,4488.148922273024字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 39072字节
- 数据集大小: 56305.989793144116字节
配置: high_school_european_history
- 测试集: 165个示例,81870.79796325309字节
- 验证集: 18个示例,8976.297844546049字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 196270字节
- 数据集大小: 93046.27124639563字节
配置: high_school_geography
- 测试集: 198个示例,98244.95755590372字节
- 验证集: 22个示例,10971.030698889615字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 38255字节
- 数据集大小: 111415.16369338983字节
配置: high_school_government_and_politics
- 测试集: 193个示例,95764.02428428999字节
- 验证集: 21个示例,10472.347485303722字节
- 开发集: 5个示例,2199.1754385964914字节
- 下载大小: 52963字节
- 数据集大小: 108435.5472081902字节
配置:
搜集汇总
数据集介绍

构建方式
MMLU数据集的构建旨在评估大规模多任务语言理解能力,由专家生成。数据集包含多个领域,如数学、生物学、物理等,每个领域都由一系列多项选择题组成,每个问题都附带一个主题、四个选项和一个正确答案。数据集分为训练集、验证集和测试集,其中训练集数量最为庞大,为99842个示例,而验证集和测试集分别包含1531和14042个示例。数据集的构建确保了问题的多样性和复杂性,以全面评估语言模型在多任务理解方面的能力。
特点
MMLU数据集的特点在于其覆盖了广泛的学科领域,为多任务语言理解提供了丰富的测试场景。每个问题都经过精心设计,不仅包含文本内容,还包含了四个可能的答案,使得模型需要具备深入理解问题和选项的能力。此外,数据集的规模适中,既包含了大量的训练数据,又提供了足够的测试数据,以便模型在多个任务上进行训练和评估。数据集的构建遵循MIT许可协议,允许用户自由使用和修改。
使用方法
MMLU数据集的使用方法相对简单。用户可以下载数据集并使用其提供的Python接口进行数据处理和模型训练。数据集提供了多个分割,包括训练集、验证集和测试集,方便用户进行模型评估和调试。此外,数据集的每个问题都附带了一个主题,用户可以根据主题进行任务划分和模型训练。需要注意的是,数据集的下载和存储空间较大,用户需要确保有足够的存储空间和计算资源。
背景与挑战
背景概述
在人工智能与自然语言处理领域,语言理解能力一直是研究的重点。随着机器学习技术的不断发展,多任务语言理解(Multitask Language Understanding, MTLU)成为了新的研究方向。MMLU数据集(Measuring Massive Multitask Language Understanding)正是在这一背景下创建的,旨在评估和促进机器在多个语言理解任务上的能力。该数据集由CAIS(Center for AI Safety)的专家团队生成,涵盖了广泛的学科领域,如数学、科学、历史等。MMLU数据集的创建,不仅为研究人员提供了一个全面的多任务语言理解评估平台,也对推动自然语言处理技术的发展产生了深远影响。
当前挑战
MMLU数据集在构建过程中面临着多个挑战。首先,如何确保数据集的多样性和覆盖性是一个关键问题。由于数据集涵盖了多个学科领域,收集和整理高质量的、代表性强的问题和答案变得尤为困难。其次,构建过程中还需要考虑数据集的平衡性,以确保模型在各个任务上都能得到公平的训练和评估。此外,MMLU数据集也面临着如何有效评估模型在多任务语言理解上的能力的问题。传统的评估指标可能不足以全面反映模型在复杂任务上的表现,因此需要探索和开发新的评估方法。
常用场景
经典使用场景
在自然语言处理领域,大规模多任务语言理解(MMLU)数据集成为了衡量模型泛化能力的重要基准。该数据集覆盖了广泛的学科领域,包括数学、物理、化学、历史、经济等,每个学科领域都包含大量的问题和答案,以多项选择题的形式呈现。MMLU数据集的经典使用场景是作为评估语言模型在特定学科领域内理解和推理能力的基准,帮助研究者评估模型在不同学科知识上的掌握程度。
实际应用
MMLU数据集在实际应用中,可以帮助教育机构评估学生或机器学习模型在不同学科领域的知识水平。通过将模型在MMLU数据集上的表现与人类专家进行比较,可以评估模型在特定学科领域的理解和推理能力,从而为教育机构提供有价值的教学反馈和改进建议。
衍生相关工作
基于MMLU数据集,研究者们开展了一系列相关工作,如开发针对特定学科领域的语言模型、研究模型在不同学科领域的知识迁移能力等。这些工作不仅推动了多任务语言模型的发展,也为教育、科研等领域带来了新的应用可能性。
以上内容由遇见数据集搜集并总结生成



