cais/mmlu

Name: cais/mmlu
Creator: cais
Published: 2024-03-08 20:36:26
License: 暂无描述

Hugging Face2024-03-08 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/cais/mmlu

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language_creators: - expert-generated language: - en license: - mit multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - question-answering task_ids: - multiple-choice-qa paperswithcode_id: mmlu pretty_name: Measuring Massive Multitask Language Understanding language_bcp47: - en-US dataset_info: - config_name: abstract_algebra features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 49618.6654322746 num_examples: 100 - name: validation num_bytes: 5485.515349444808 num_examples: 11 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 17143 dataset_size: 57303.3562203159 - config_name: all features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 6967453 num_examples: 14042 - name: validation num_bytes: 763484 num_examples: 1531 - name: dev num_bytes: 125353 num_examples: 285 - name: auxiliary_train num_bytes: 161000625 num_examples: 99842 download_size: 51503402 dataset_size: 168856915 - config_name: anatomy features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 66985.19833357072 num_examples: 135 - name: validation num_bytes: 6981.5649902024825 num_examples: 14 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 28864 dataset_size: 76165.9387623697 - config_name: astronomy features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 75420.3714570574 num_examples: 152 - name: validation num_bytes: 7978.931417374265 num_examples: 16 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 39316 dataset_size: 85598.47831302814 - config_name: auxiliary_train features: - name: train struct: - name: answer dtype: int64 - name: choices sequence: string - name: question dtype: string - name: subject dtype: string splits: - name: train num_bytes: 161000625 num_examples: 99842 download_size: 47518592 dataset_size: 161000625 - config_name: business_ethics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 49618.6654322746 num_examples: 100 - name: validation num_bytes: 5485.515349444808 num_examples: 11 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 31619 dataset_size: 57303.3562203159 - config_name: clinical_knowledge features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 131489.4633955277 num_examples: 265 - name: validation num_bytes: 14461.813193990856 num_examples: 29 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 51655 dataset_size: 148150.45202811505 - config_name: college_biology features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 71450.87822247542 num_examples: 144 - name: validation num_bytes: 7978.931417374265 num_examples: 16 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 43017 dataset_size: 81628.98507844617 - config_name: college_chemistry features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 49618.6654322746 num_examples: 100 - name: validation num_bytes: 3989.4657086871325 num_examples: 8 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 26781 dataset_size: 55807.30657955822 - config_name: college_computer_science features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 49618.6654322746 num_examples: 100 - name: validation num_bytes: 5485.515349444808 num_examples: 11 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 41132 dataset_size: 57303.3562203159 - config_name: college_mathematics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 49618.6654322746 num_examples: 100 - name: validation num_bytes: 5485.515349444808 num_examples: 11 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 26779 dataset_size: 57303.3562203159 - config_name: college_medicine features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 85840.29119783506 num_examples: 173 - name: validation num_bytes: 10971.030698889615 num_examples: 22 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 56303 dataset_size: 99010.49733532117 - config_name: college_physics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 50611.0387409201 num_examples: 102 - name: validation num_bytes: 5485.515349444808 num_examples: 11 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 29539 dataset_size: 58295.7295289614 - config_name: computer_security features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 49618.6654322746 num_examples: 100 - name: validation num_bytes: 5485.515349444808 num_examples: 11 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 30150 dataset_size: 57303.3562203159 - config_name: conceptual_physics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 116603.86376584532 num_examples: 235 - name: validation num_bytes: 12965.76355323318 num_examples: 26 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 34968 dataset_size: 131768.802757675 - config_name: econometrics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 56565.27859279305 num_examples: 114 - name: validation num_bytes: 5984.198563030699 num_examples: 12 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 36040 dataset_size: 64748.652594420244 - config_name: electrical_engineering features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 71947.06487679818 num_examples: 145 - name: validation num_bytes: 7978.931417374265 num_examples: 16 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 26746 dataset_size: 82125.17173276893 - config_name: elementary_mathematics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 187558.555333998 num_examples: 378 - name: validation num_bytes: 20446.011757021555 num_examples: 41 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 54987 dataset_size: 210203.74252961605 - config_name: formal_logic features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 62519.518444666 num_examples: 126 - name: validation num_bytes: 6981.5649902024825 num_examples: 14 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 32884 dataset_size: 71700.25887346498 - config_name: global_facts features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 49618.6654322746 num_examples: 100 - name: validation num_bytes: 4986.8321358589155 num_examples: 10 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 19258 dataset_size: 56804.67300673001 - config_name: high_school_biology features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 153817.86284005127 num_examples: 310 - name: validation num_bytes: 15957.86283474853 num_examples: 32 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 78216 dataset_size: 171974.90111339628 - config_name: high_school_chemistry features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 100725.89082751745 num_examples: 203 - name: validation num_bytes: 10971.030698889615 num_examples: 22 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 45799 dataset_size: 113896.09696500355 - config_name: high_school_computer_science features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 49618.6654322746 num_examples: 100 - name: validation num_bytes: 4488.148922273024 num_examples: 9 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 39072 dataset_size: 56305.989793144116 - config_name: high_school_european_history features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 81870.79796325309 num_examples: 165 - name: validation num_bytes: 8976.297844546049 num_examples: 18 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 196270 dataset_size: 93046.27124639563 - config_name: high_school_geography features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 98244.95755590372 num_examples: 198 - name: validation num_bytes: 10971.030698889615 num_examples: 22 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 38255 dataset_size: 111415.16369338983 - config_name: high_school_government_and_politics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 95764.02428428999 num_examples: 193 - name: validation num_bytes: 10472.347485303722 num_examples: 21 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 52963 dataset_size: 108435.5472081902 - config_name: high_school_macroeconomics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 193512.79518587096 num_examples: 390 - name: validation num_bytes: 21443.378184193338 num_examples: 43 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 68758 dataset_size: 217155.34880866078 - config_name: high_school_mathematics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 133970.39666714144 num_examples: 270 - name: validation num_bytes: 14461.813193990856 num_examples: 29 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 45210 dataset_size: 150631.38529972878 - config_name: high_school_microeconomics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 118092.42372881356 num_examples: 238 - name: validation num_bytes: 12965.76355323318 num_examples: 26 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 49885 dataset_size: 133257.36272064323 - config_name: high_school_physics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 74924.18480273466 num_examples: 151 - name: validation num_bytes: 8477.614630960157 num_examples: 17 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 45483 dataset_size: 85600.9748722913 - config_name: high_school_psychology features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 270421.7266058966 num_examples: 545 - name: validation num_bytes: 29920.992815153495 num_examples: 60 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 113158 dataset_size: 302541.8948596466 - config_name: high_school_statistics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 107176.31733371314 num_examples: 216 - name: validation num_bytes: 11469.713912475507 num_examples: 23 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 74924 dataset_size: 120845.20668478514 - config_name: high_school_us_history features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 101222.0774818402 num_examples: 204 - name: validation num_bytes: 10971.030698889615 num_examples: 22 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 200043 dataset_size: 114392.2836193263 - config_name: high_school_world_history features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 117596.23707449081 num_examples: 237 - name: validation num_bytes: 12965.76355323318 num_examples: 26 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 250302 dataset_size: 132761.17606632048 - config_name: human_aging features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 110649.62391397236 num_examples: 223 - name: validation num_bytes: 11469.713912475507 num_examples: 23 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 41196 dataset_size: 124318.51326504436 - config_name: human_sexuality features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 65000.451716279735 num_examples: 131 - name: validation num_bytes: 5984.198563030699 num_examples: 12 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 32533 dataset_size: 73183.82571790692 - config_name: international_law features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 60038.58517305227 num_examples: 121 - name: validation num_bytes: 6482.88177661659 num_examples: 13 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 41592 dataset_size: 68720.64238826535 - config_name: jurisprudence features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 53588.15866685657 num_examples: 108 - name: validation num_bytes: 5485.515349444808 num_examples: 11 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 33578 dataset_size: 61272.84945489787 - config_name: logical_fallacies features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 80878.4246546076 num_examples: 163 - name: validation num_bytes: 8976.297844546049 num_examples: 18 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 33669 dataset_size: 92053.89793775014 - config_name: machine_learning features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 55572.90528414756 num_examples: 112 - name: validation num_bytes: 5485.515349444808 num_examples: 11 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 31121 dataset_size: 63257.596072188855 - config_name: management features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 51107.225395242844 num_examples: 103 - name: validation num_bytes: 5485.515349444808 num_examples: 11 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 22828 dataset_size: 58791.91618328414 - config_name: marketing features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 116107.67711152257 num_examples: 234 - name: validation num_bytes: 12467.08033964729 num_examples: 25 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 49747 dataset_size: 130773.93288976635 - config_name: medical_genetics features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 49618.6654322746 num_examples: 100 - name: validation num_bytes: 5485.515349444808 num_examples: 11 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 25775 dataset_size: 57303.3562203159 - config_name: miscellaneous features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 388514.15033471014 num_examples: 783 - name: validation num_bytes: 42886.756368386676 num_examples: 86 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 115097 dataset_size: 433600.08214169333 - config_name: moral_disputes features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 171680.58239567012 num_examples: 346 - name: validation num_bytes: 18949.96211626388 num_examples: 38 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 76043 dataset_size: 192829.71995053047 - config_name: moral_scenarios features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 444087.05561885773 num_examples: 895 - name: validation num_bytes: 49868.32135858916 num_examples: 100 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 109869 dataset_size: 496154.5524160434 - config_name: nutrition features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 151833.1162227603 num_examples: 306 - name: validation num_bytes: 16456.54604833442 num_examples: 33 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 69050 dataset_size: 170488.8377096912 - config_name: philosophy features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 154314.04949437402 num_examples: 311 - name: validation num_bytes: 16955.229261920314 num_examples: 34 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 61912 dataset_size: 173468.45419489083 - config_name: prehistory features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 160764.47600056973 num_examples: 324 - name: validation num_bytes: 17453.912475506204 num_examples: 35 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 68826 dataset_size: 180417.5639146724 - config_name: professional_accounting features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 139924.6365190144 num_examples: 282 - name: validation num_bytes: 15459.179621162639 num_examples: 31 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 87297 dataset_size: 157582.99157877354 - config_name: professional_law features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 761150.3277310925 num_examples: 1534 - name: validation num_bytes: 84776.14630960157 num_examples: 170 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 1167828 dataset_size: 848125.6494792906 - config_name: professional_medicine features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 134962.7699757869 num_examples: 272 - name: validation num_bytes: 15459.179621162639 num_examples: 31 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 153242 dataset_size: 152621.12503554605 - config_name: professional_psychology features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 303666.2324455206 num_examples: 612 - name: validation num_bytes: 34409.14173742652 num_examples: 69 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 159357 dataset_size: 340274.5496215436 - config_name: public_relations features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 54580.53197550207 num_examples: 110 - name: validation num_bytes: 5984.198563030699 num_examples: 12 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 31500 dataset_size: 62763.90597712925 - config_name: security_studies features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 121565.73030907278 num_examples: 245 - name: validation num_bytes: 13464.446766819072 num_examples: 27 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 140258 dataset_size: 137229.35251448833 - config_name: sociology features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 99733.51751887196 num_examples: 201 - name: validation num_bytes: 10971.030698889615 num_examples: 22 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 56480 dataset_size: 112903.72365635807 - config_name: us_foreign_policy features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 49618.6654322746 num_examples: 100 - name: validation num_bytes: 5485.515349444808 num_examples: 11 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 29027 dataset_size: 57303.3562203159 - config_name: virology features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 82366.98461757584 num_examples: 166 - name: validation num_bytes: 8976.297844546049 num_examples: 18 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 38229 dataset_size: 93542.45790071838 - config_name: world_religions features: - name: question dtype: string - name: subject dtype: string - name: choices sequence: string - name: answer dtype: class_label: names: '0': A '1': B '2': C '3': D splits: - name: test num_bytes: 84847.91788918957 num_examples: 171 - name: validation num_bytes: 9474.98105813194 num_examples: 19 - name: dev num_bytes: 2199.1754385964914 num_examples: 5 download_size: 27165 dataset_size: 96522.07438591801 configs: - config_name: abstract_algebra data_files: - split: test path: abstract_algebra/test-* - split: validation path: abstract_algebra/validation-* - split: dev path: abstract_algebra/dev-* - config_name: all data_files: - split: test path: all/test-* - split: validation path: all/validation-* - split: dev path: all/dev-* - split: auxiliary_train path: all/auxiliary_train-* - config_name: anatomy data_files: - split: test path: anatomy/test-* - split: validation path: anatomy/validation-* - split: dev path: anatomy/dev-* - config_name: astronomy data_files: - split: test path: astronomy/test-* - split: validation path: astronomy/validation-* - split: dev path: astronomy/dev-* - config_name: auxiliary_train data_files: - split: train path: auxiliary_train/train-* - config_name: business_ethics data_files: - split: test path: business_ethics/test-* - split: validation path: business_ethics/validation-* - split: dev path: business_ethics/dev-* - config_name: clinical_knowledge data_files: - split: test path: clinical_knowledge/test-* - split: validation path: clinical_knowledge/validation-* - split: dev path: clinical_knowledge/dev-* - config_name: college_biology data_files: - split: test path: college_biology/test-* - split: validation path: college_biology/validation-* - split: dev path: college_biology/dev-* - config_name: college_chemistry data_files: - split: test path: college_chemistry/test-* - split: validation path: college_chemistry/validation-* - split: dev path: college_chemistry/dev-* - config_name: college_computer_science data_files: - split: test path: college_computer_science/test-* - split: validation path: college_computer_science/validation-* - split: dev path: college_computer_science/dev-* - config_name: college_mathematics data_files: - split: test path: college_mathematics/test-* - split: validation path: college_mathematics/validation-* - split: dev path: college_mathematics/dev-* - config_name: college_medicine data_files: - split: test path: college_medicine/test-* - split: validation path: college_medicine/validation-* - split: dev path: college_medicine/dev-* - config_name: college_physics data_files: - split: test path: college_physics/test-* - split: validation path: college_physics/validation-* - split: dev path: college_physics/dev-* - config_name: computer_security data_files: - split: test path: computer_security/test-* - split: validation path: computer_security/validation-* - split: dev path: computer_security/dev-* - config_name: conceptual_physics data_files: - split: test path: conceptual_physics/test-* - split: validation path: conceptual_physics/validation-* - split: dev path: conceptual_physics/dev-* - config_name: econometrics data_files: - split: test path: econometrics/test-* - split: validation path: econometrics/validation-* - split: dev path: econometrics/dev-* - config_name: electrical_engineering data_files: - split: test path: electrical_engineering/test-* - split: validation path: electrical_engineering/validation-* - split: dev path: electrical_engineering/dev-* - config_name: elementary_mathematics data_files: - split: test path: elementary_mathematics/test-* - split: validation path: elementary_mathematics/validation-* - split: dev path: elementary_mathematics/dev-* - config_name: formal_logic data_files: - split: test path: formal_logic/test-* - split: validation path: formal_logic/validation-* - split: dev path: formal_logic/dev-* - config_name: global_facts data_files: - split: test path: global_facts/test-* - split: validation path: global_facts/validation-* - split: dev path: global_facts/dev-* - config_name: high_school_biology data_files: - split: test path: high_school_biology/test-* - split: validation path: high_school_biology/validation-* - split: dev path: high_school_biology/dev-* - config_name: high_school_chemistry data_files: - split: test path: high_school_chemistry/test-* - split: validation path: high_school_chemistry/validation-* - split: dev path: high_school_chemistry/dev-* - config_name: high_school_computer_science data_files: - split: test path: high_school_computer_science/test-* - split: validation path: high_school_computer_science/validation-* - split: dev path: high_school_computer_science/dev-* - config_name: high_school_european_history data_files: - split: test path: high_school_european_history/test-* - split: validation path: high_school_european_history/validation-* - split: dev path: high_school_european_history/dev-* - config_name: high_school_geography data_files: - split: test path: high_school_geography/test-* - split: validation path: high_school_geography/validation-* - split: dev path: high_school_geography/dev-* - config_name: high_school_government_and_politics data_files: - split: test path: high_school_government_and_politics/test-* - split: validation path: high_school_government_and_politics/validation-* - split: dev path: high_school_government_and_politics/dev-* - config_name: high_school_macroeconomics data_files: - split: test path: high_school_macroeconomics/test-* - split: validation path: high_school_macroeconomics/validation-* - split: dev path: high_school_macroeconomics/dev-* - config_name: high_school_mathematics data_files: - split: test path: high_school_mathematics/test-* - split: validation path: high_school_mathematics/validation-* - split: dev path: high_school_mathematics/dev-* - config_name: high_school_microeconomics data_files: - split: test path: high_school_microeconomics/test-* - split: validation path: high_school_microeconomics/validation-* - split: dev path: high_school_microeconomics/dev-* - config_name: high_school_physics data_files: - split: test path: high_school_physics/test-* - split: validation path: high_school_physics/validation-* - split: dev path: high_school_physics/dev-* - config_name: high_school_psychology data_files: - split: test path: high_school_psychology/test-* - split: validation path: high_school_psychology/validation-* - split: dev path: high_school_psychology/dev-* - config_name: high_school_statistics data_files: - split: test path: high_school_statistics/test-* - split: validation path: high_school_statistics/validation-* - split: dev path: high_school_statistics/dev-* - config_name: high_school_us_history data_files: - split: test path: high_school_us_history/test-* - split: validation path: high_school_us_history/validation-* - split: dev path: high_school_us_history/dev-* - config_name: high_school_world_history data_files: - split: test path: high_school_world_history/test-* - split: validation path: high_school_world_history/validation-* - split: dev path: high_school_world_history/dev-* - config_name: human_aging data_files: - split: test path: human_aging/test-* - split: validation path: human_aging/validation-* - split: dev path: human_aging/dev-* - config_name: human_sexuality data_files: - split: test path: human_sexuality/test-* - split: validation path: human_sexuality/validation-* - split: dev path: human_sexuality/dev-* - config_name: international_law data_files: - split: test path: international_law/test-* - split: validation path: international_law/validation-* - split: dev path: international_law/dev-* - config_name: jurisprudence data_files: - split: test path: jurisprudence/test-* - split: validation path: jurisprudence/validation-* - split: dev path: jurisprudence/dev-* - config_name: logical_fallacies data_files: - split: test path: logical_fallacies/test-* - split: validation path: logical_fallacies/validation-* - split: dev path: logical_fallacies/dev-* - config_name: machine_learning data_files: - split: test path: machine_learning/test-* - split: validation path: machine_learning/validation-* - split: dev path: machine_learning/dev-* - config_name: management data_files: - split: test path: management/test-* - split: validation path: management/validation-* - split: dev path: management/dev-* - config_name: marketing data_files: - split: test path: marketing/test-* - split: validation path: marketing/validation-* - split: dev path: marketing/dev-* - config_name: medical_genetics data_files: - split: test path: medical_genetics/test-* - split: validation path: medical_genetics/validation-* - split: dev path: medical_genetics/dev-* - config_name: miscellaneous data_files: - split: test path: miscellaneous/test-* - split: validation path: miscellaneous/validation-* - split: dev path: miscellaneous/dev-* - config_name: moral_disputes data_files: - split: test path: moral_disputes/test-* - split: validation path: moral_disputes/validation-* - split: dev path: moral_disputes/dev-* - config_name: moral_scenarios data_files: - split: test path: moral_scenarios/test-* - split: validation path: moral_scenarios/validation-* - split: dev path: moral_scenarios/dev-* - config_name: nutrition data_files: - split: test path: nutrition/test-* - split: validation path: nutrition/validation-* - split: dev path: nutrition/dev-* - config_name: philosophy data_files: - split: test path: philosophy/test-* - split: validation path: philosophy/validation-* - split: dev path: philosophy/dev-* - config_name: prehistory data_files: - split: test path: prehistory/test-* - split: validation path: prehistory/validation-* - split: dev path: prehistory/dev-* - config_name: professional_accounting data_files: - split: test path: professional_accounting/test-* - split: validation path: professional_accounting/validation-* - split: dev path: professional_accounting/dev-* - config_name: professional_law data_files: - split: test path: professional_law/test-* - split: validation path: professional_law/validation-* - split: dev path: professional_law/dev-* - config_name: professional_medicine data_files: - split: test path: professional_medicine/test-* - split: validation path: professional_medicine/validation-* - split: dev path: professional_medicine/dev-* - config_name: professional_psychology data_files: - split: test path: professional_psychology/test-* - split: validation path: professional_psychology/validation-* - split: dev path: professional_psychology/dev-* - config_name: public_relations data_files: - split: test path: public_relations/test-* - split: validation path: public_relations/validation-* - split: dev path: public_relations/dev-* - config_name: security_studies data_files: - split: test path: security_studies/test-* - split: validation path: security_studies/validation-* - split: dev path: security_studies/dev-* - config_name: sociology data_files: - split: test path: sociology/test-* - split: validation path: sociology/validation-* - split: dev path: sociology/dev-* - config_name: us_foreign_policy data_files: - split: test path: us_foreign_policy/test-* - split: validation path: us_foreign_policy/validation-* - split: dev path: us_foreign_policy/dev-* - config_name: virology data_files: - split: test path: virology/test-* - split: validation path: virology/validation-* - split: dev path: virology/dev-* - config_name: world_religions data_files: - split: test path: world_religions/test-* - split: validation path: world_religions/validation-* - split: dev path: world_religions/dev-* --- # Dataset Card for MMLU ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository**: https://github.com/hendrycks/test - **Paper**: https://arxiv.org/abs/2009.03300 ### Dataset Summary [Measuring Massive Multitask Language Understanding](https://arxiv.org/pdf/2009.03300) by [Dan Hendrycks](https://people.eecs.berkeley.edu/~hendrycks/), [Collin Burns](http://collinpburns.com), [Steven Basart](https://stevenbas.art), Andy Zou, Mantas Mazeika, [Dawn Song](https://people.eecs.berkeley.edu/~dawnsong/), and [Jacob Steinhardt](https://www.stat.berkeley.edu/~jsteinhardt/) (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. A complete list of tasks: ['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions'] ### Supported Tasks and Leaderboards | Model | Authors | Humanities | Social Science | STEM | Other | Average | |------------------------------------|----------|:-------:|:-------:|:-------:|:-------:|:-------:| | [UnifiedQA](https://arxiv.org/abs/2005.00700) | Khashabi et al., 2020 | 45.6 | 56.6 | 40.2 | 54.6 | 48.9 | [GPT-3](https://arxiv.org/abs/2005.14165) (few-shot) | Brown et al., 2020 | 40.8 | 50.4 | 36.7 | 48.8 | 43.9 | [GPT-2](https://arxiv.org/abs/2005.14165) | Radford et al., 2019 | 32.8 | 33.3 | 30.2 | 33.1 | 32.4 | Random Baseline | N/A | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 ### Languages English ## Dataset Structure ### Data Instances An example from anatomy subtask looks as follows: ``` { "question": "What is the embryological origin of the hyoid bone?", "choices": ["The first pharyngeal arch", "The first and second pharyngeal arches", "The second pharyngeal arch", "The second and third pharyngeal arches"], "answer": "D" } ``` ### Data Fields - `question`: a string feature - `choices`: a list of 4 string features - `answer`: a ClassLabel feature ### Data Splits - `auxiliary_train`: auxiliary multiple-choice training questions from ARC, MC_TEST, OBQA, RACE, etc. - `dev`: 5 examples per subtask, meant for few-shot setting - `test`: there are at least 100 examples per subtask | | auxiliary_train | dev | val | test | | ----- | :------: | :-----: | :-----: | :-----: | | TOTAL | 99842 | 285 | 1531 | 14042 ## Dataset Creation ### Curation Rationale Transformer models have driven this recent progress by pretraining on massive text corpora, including all of Wikipedia, thousands of books, and numerous websites. These models consequently see extensive information about specialized topics, most of which is not assessed by existing NLP benchmarks. To bridge the gap between the wide-ranging knowledge that models see during pretraining and the existing measures of success, we introduce a new benchmark for assessing models across a diverse set of subjects that humans learn. ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [MIT License](https://github.com/hendrycks/test/blob/master/LICENSE) ### Citation Information If you find this useful in your research, please consider citing the test and also the [ETHICS](https://arxiv.org/abs/2008.02275) dataset it draws from: ``` @article{hendryckstest2021, title={Measuring Massive Multitask Language Understanding}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} } @article{hendrycks2021ethics, title={Aligning AI With Shared Human Values}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} } ``` ### Contributions Thanks to [@andyzoujm](https://github.com/andyzoujm) for adding this dataset.

### 数据集元数据 - 标注创建者：无标注（no-annotation） - 语言创建者：专家生成（expert-generated） - 语言：英语（en） - 许可证：MIT许可证（mit） - 多语言属性：单语言（monolingual） - 样本规模：10000 < n < 100000 - 源数据集：原创数据集（original） - 任务类别：问答（question-answering） - 任务子类别：多项选择问答（multiple-choice-qa） - PapersWithCode ID：mmlu - 展示名称：大规模多任务语言理解测评（Measuring Massive Multitask Language Understanding） - 语言BCP47标签：en-US ## 数据集配置详情本数据集包含多个学科专属的任务配置，通用结构如下： - 配置名称：[学科名称]（[英文配置名]）特征字段： - `question`：字符串类型，存储试题题干 - `subject`：字符串类型，存储试题所属学科 - `choices`：字符串序列，包含4个候选选项 - `answer`：类别标签特征，类别映射关系为：'0': A, '1': B, '2': C, '3': D 数据划分： - 测试集（test）：包含对应学科的测评样本 - 验证集（validation）：用于模型验证的样本 - 开发集（dev）：每个子任务固定包含5个样本，用于少样本学习场景下载大小：对应配置的数据集下载体积数据集总大小：对应配置的全部数据体积完整任务配置列表对应前文的57项学科任务。 ## MMLU 数据集卡片 ### 目录 - 目录 - 数据集描述 - 数据集概览 - 支持任务与评测基准 - 语言 - 数据集结构 - 数据样例 - 数据字段 - 数据划分 - 数据集构建 - 构建初衷 - 源数据 - 标注信息 - 个人与敏感信息 - 数据集使用注意事项 - 数据集的社会影响 - 偏差讨论 - 其他已知局限性 - 附加信息 - 数据集维护者 - 许可信息 - 引用信息 - 贡献 ## 数据集描述 - **代码仓库**：https://github.com/hendrycks/test - **论文链接**：https://arxiv.org/abs/2009.03300 ### 数据集概览本数据集为《大规模多任务语言理解测评（Measuring Massive Multitask Language Understanding）》，由Dan Hendrycks、Collin Burns、Steven Basart、Andy Zou、Mantas Mazeika、Dawn Song以及Jacob Steinhardt共同完成，发表于2021年国际学习表征会议（ICLR 2021）。这是一个大规模多任务测评集，包含来自多个知识分支的多项选择题。该测评覆盖人文社科、自然科学及其他大众应知的多个领域，共计57项任务，包括初等数学、美国历史、计算机科学、法学等。若要在该测评中取得高准确率，模型需具备广博的世界知识与问题求解能力。完整任务列表如下： ['抽象代数', '解剖学', '天文学', '商业伦理', '临床知识', '大学基础生物学', '大学化学', '大学计算机科学', '大学数学', '大学医学', '大学物理', '计算机安全', '概念物理', '计量经济学', '电气工程', '初等数学', '形式逻辑', '全球常识', '高中生物学', '高中化学', '高中计算机科学', '高中欧洲历史', '高中地理学', '高中政府与政治学', '高中宏观经济学', '高中数学', '高中微观经济学', '高中物理学', '高中心理学', '高中统计学', '美国高中历史', '高中世界历史', '人类衰老', '人类性学', '国际法', '法理学', '逻辑谬误', '机器学习', '管理学', '市场营销学', '医学遗传学', '综合杂项', '道德争议', '道德情境', '营养学', '哲学', '史前史', '专业会计学', '专业法学', '专业医学', '专业心理学', '公共关系学', '安全研究', '社会学', '美国外交政策', '病毒学', '世界宗教'] ### 支持任务与评测基准 | 模型 | 作者 | 人文社科 | 社会科学 | STEM | 其他 | 平均得分 | |------------------------------------|----------|:-------:|:-------:|:-------:|:-------:|:-------:| | [统一问答（UnifiedQA）](https://arxiv.org/abs/2005.00700) | Khashabi等人，2020 | 45.6 | 56.6 | 40.2 | 54.6 | 48.9 | [GPT-3（少样本）](https://arxiv.org/abs/2005.14165) | Brown等人，2020 | 40.8 | 50.4 | 36.7 | 48.8 | 43.9 | [GPT-2](https://arxiv.org/abs/2005.14165) | Radford等人，2019 | 32.8 | 33.3 | 30.2 | 33.1 | 32.4 | 随机基线 | 无 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 ### 语言英语 ## 数据集结构 ### 数据样例以下为解剖学子任务的一个样例： { "question": "舌骨的胚胎学起源是什么？", "choices": ["第一鳃弓", "第一和第二鳃弓", "第二鳃弓", "第二和第三鳃弓"], "answer": "D" } ### 数据字段 - `question`：字符串类型特征，存储试题题干 - `choices`：包含4个字符串的列表特征，存储所有候选选项 - `answer`：类别标签特征，标识正确选项对应的字母 ### 数据划分 - `auxiliary_train`：来自ARC、MC_TEST、OBQA、RACE等数据集的辅助多项选择题训练样本 - `dev`：每个子任务包含5个样本，用于少样本学习场景 - `test`：每个子任务至少包含100个样本 | | 辅助训练集 | 开发集 | 验证集 | 测试集 | | ----- | :------: | :-----: | :-----: | :-----: | | 总计 | 99842 | 285 | 1531 | 14042 ## 数据集构建 ### 构建初衷 Transformer模型（Transformer）通过在大规模文本语料库上预训练实现了近期的性能突破，这些语料库包括全部维基百科内容、数千本图书及海量网页。因此，这些模型会接触到大量专业领域信息，但现有自然语言处理基准大多未覆盖这些内容。为了弥合模型预训练阶段习得的广泛知识与现有性能评估指标之间的差距，我们推出了这一新基准，用于测评模型在人类学习过的多样化学科上的表现。 ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源语言生产者是谁？ [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注者是谁？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息本数据集采用MIT许可（MIT License），详见https://github.com/hendrycks/test/blob/master/LICENSE ### 引用信息若您在研究中使用本数据集，请引用该测评相关论文以及其借鉴的[ETHICS](https://arxiv.org/abs/2008.02275)数据集： @article{hendryckstest2021, title={Measuring Massive Multitask Language Understanding}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} } @article{hendrycks2021ethics, title={Aligning AI With Shared Human Values}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} } ### 贡献感谢[@andyzoujm](https://github.com/andyzoujm) 为本数据集提供支持。

提供机构：

cais

原始信息汇总

数据集概述

基本信息

语言: 英语 (en)
许可证: MIT
多语言性: 单语种
大小范围: 10K<n<100K
数据来源: 原始数据
任务类别: 问答
任务ID: 多选题问答 (multiple-choice-qa)
论文代码ID: mmlu
美观名称: 测量大规模多任务语言理解

数据集结构

特征

问题 (question): 字符串类型
主题 (subject): 字符串类型
选项 (choices): 字符串序列类型
答案 (answer): 分类标签类型，选项为A, B, C, D

分割

测试集 (test): 不同配置下的示例数和字节数不同
验证集 (validation): 不同配置下的示例数和字节数不同
开发集 (dev): 不同配置下的示例数和字节数不同
辅助训练集 (auxiliary_train): 不同配置下的示例数和字节数不同

数据集大小

下载大小: 不同配置下的下载大小不同
数据集大小: 不同配置下的数据集大小不同

配置详情

配置: abstract_algebra

测试集: 100个示例，49618.6654322746字节
验证集: 11个示例，5485.515349444808字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 17143字节
数据集大小: 57303.3562203159字节

配置: all

测试集: 14042个示例，6967453字节
验证集: 1531个示例，763484字节
开发集: 285个示例，125353字节
辅助训练集: 99842个示例，161000625字节
下载大小: 51503402字节
数据集大小: 168856915字节

配置: anatomy

测试集: 135个示例，66985.19833357072字节
验证集: 14个示例，6981.5649902024825字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 28864字节
数据集大小: 76165.9387623697字节

配置: astronomy

测试集: 152个示例，75420.3714570574字节
验证集: 16个示例，7978.931417374265字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 39316字节
数据集大小: 85598.47831302814字节

配置: auxiliary_train

训练集: 99842个示例，161000625字节
下载大小: 47518592字节
数据集大小: 161000625字节

配置: business_ethics

测试集: 100个示例，49618.6654322746字节
验证集: 11个示例，5485.515349444808字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 31619字节
数据集大小: 57303.3562203159字节

配置: clinical_knowledge

测试集: 265个示例，131489.4633955277字节
验证集: 29个示例，14461.813193990856字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 51655字节
数据集大小: 148150.45202811505字节

配置: college_biology

测试集: 144个示例，71450.87822247542字节
验证集: 16个示例，7978.931417374265字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 43017字节
数据集大小: 81628.98507844617字节

配置: college_chemistry

测试集: 100个示例，49618.6654322746字节
验证集: 8个示例，3989.4657086871325字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 26781字节
数据集大小: 55807.30657955822字节

配置: college_computer_science

测试集: 100个示例，49618.6654322746字节
验证集: 11个示例，5485.515349444808字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 41132字节
数据集大小: 57303.3562203159字节

配置: college_mathematics

测试集: 100个示例，49618.6654322746字节
验证集: 11个示例，5485.515349444808字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 26779字节
数据集大小: 57303.3562203159字节

配置: college_medicine

测试集: 173个示例，85840.29119783506字节
验证集: 22个示例，10971.030698889615字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 56303字节
数据集大小: 99010.49733532117字节

配置: college_physics

测试集: 102个示例，50611.0387409201字节
验证集: 11个示例，5485.515349444808字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 29539字节
数据集大小: 58295.7295289614字节

配置: computer_security

测试集: 100个示例，49618.6654322746字节
验证集: 11个示例，5485.515349444808字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 30150字节
数据集大小: 57303.3562203159字节

配置: conceptual_physics

测试集: 235个示例，116603.86376584532字节
验证集: 26个示例，12965.76355323318字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 34968字节
数据集大小: 131768.802757675字节

配置: econometrics

测试集: 114个示例，56565.27859279305字节
验证集: 12个示例，5984.198563030699字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 36040字节
数据集大小: 64748.652594420244字节

配置: electrical_engineering

测试集: 145个示例，71947.06487679818字节
验证集: 16个示例，7978.931417374265字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 26746字节
数据集大小: 82125.17173276893字节

配置: elementary_mathematics

测试集: 378个示例，187558.555333998字节
验证集: 41个示例，20446.011757021555字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 54987字节
数据集大小: 210203.74252961605字节

配置: formal_logic

测试集: 126个示例，62519.518444666字节
验证集: 14个示例，6981.5649902024825字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 32884字节
数据集大小: 71700.25887346498字节

配置: global_facts

测试集: 100个示例，49618.6654322746字节
验证集: 10个示例，4986.8321358589155字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 19258字节
数据集大小: 56804.67300673001字节

配置: high_school_biology

测试集: 310个示例，153817.86284005127字节
验证集: 32个示例，15957.86283474853字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 78216字节
数据集大小: 171974.90111339628字节

配置: high_school_chemistry

测试集: 203个示例，100725.89082751745字节
验证集: 22个示例，10971.030698889615字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 45799字节
数据集大小: 113896.09696500355字节

配置: high_school_computer_science

测试集: 100个示例，49618.6654322746字节
验证集: 9个示例，4488.148922273024字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 39072字节
数据集大小: 56305.989793144116字节

配置: high_school_european_history

测试集: 165个示例，81870.79796325309字节
验证集: 18个示例，8976.297844546049字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 196270字节
数据集大小: 93046.27124639563字节

配置: high_school_geography

测试集: 198个示例，98244.95755590372字节
验证集: 22个示例，10971.030698889615字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 38255字节
数据集大小: 111415.16369338983字节

配置: high_school_government_and_politics

测试集: 193个示例，95764.02428428999字节
验证集: 21个示例，10472.347485303722字节
开发集: 5个示例，2199.1754385964914字节
下载大小: 52963字节
数据集大小: 108435.5472081902字节

配置:

搜集汇总

数据集介绍

构建方式

MMLU数据集的构建旨在评估大规模多任务语言理解能力，由专家生成。数据集包含多个领域，如数学、生物学、物理等，每个领域都由一系列多项选择题组成，每个问题都附带一个主题、四个选项和一个正确答案。数据集分为训练集、验证集和测试集，其中训练集数量最为庞大，为99842个示例，而验证集和测试集分别包含1531和14042个示例。数据集的构建确保了问题的多样性和复杂性，以全面评估语言模型在多任务理解方面的能力。

特点

MMLU数据集的特点在于其覆盖了广泛的学科领域，为多任务语言理解提供了丰富的测试场景。每个问题都经过精心设计，不仅包含文本内容，还包含了四个可能的答案，使得模型需要具备深入理解问题和选项的能力。此外，数据集的规模适中，既包含了大量的训练数据，又提供了足够的测试数据，以便模型在多个任务上进行训练和评估。数据集的构建遵循MIT许可协议，允许用户自由使用和修改。

使用方法

MMLU数据集的使用方法相对简单。用户可以下载数据集并使用其提供的Python接口进行数据处理和模型训练。数据集提供了多个分割，包括训练集、验证集和测试集，方便用户进行模型评估和调试。此外，数据集的每个问题都附带了一个主题，用户可以根据主题进行任务划分和模型训练。需要注意的是，数据集的下载和存储空间较大，用户需要确保有足够的存储空间和计算资源。

背景与挑战

背景概述

在人工智能与自然语言处理领域，语言理解能力一直是研究的重点。随着机器学习技术的不断发展，多任务语言理解（Multitask Language Understanding, MTLU）成为了新的研究方向。MMLU数据集（Measuring Massive Multitask Language Understanding）正是在这一背景下创建的，旨在评估和促进机器在多个语言理解任务上的能力。该数据集由CAIS（Center for AI Safety）的专家团队生成，涵盖了广泛的学科领域，如数学、科学、历史等。MMLU数据集的创建，不仅为研究人员提供了一个全面的多任务语言理解评估平台，也对推动自然语言处理技术的发展产生了深远影响。

当前挑战

MMLU数据集在构建过程中面临着多个挑战。首先，如何确保数据集的多样性和覆盖性是一个关键问题。由于数据集涵盖了多个学科领域，收集和整理高质量的、代表性强的问题和答案变得尤为困难。其次，构建过程中还需要考虑数据集的平衡性，以确保模型在各个任务上都能得到公平的训练和评估。此外，MMLU数据集也面临着如何有效评估模型在多任务语言理解上的能力的问题。传统的评估指标可能不足以全面反映模型在复杂任务上的表现，因此需要探索和开发新的评估方法。

常用场景

经典使用场景

在自然语言处理领域，大规模多任务语言理解（MMLU）数据集成为了衡量模型泛化能力的重要基准。该数据集覆盖了广泛的学科领域，包括数学、物理、化学、历史、经济等，每个学科领域都包含大量的问题和答案，以多项选择题的形式呈现。MMLU数据集的经典使用场景是作为评估语言模型在特定学科领域内理解和推理能力的基准，帮助研究者评估模型在不同学科知识上的掌握程度。

实际应用

MMLU数据集在实际应用中，可以帮助教育机构评估学生或机器学习模型在不同学科领域的知识水平。通过将模型在MMLU数据集上的表现与人类专家进行比较，可以评估模型在特定学科领域的理解和推理能力，从而为教育机构提供有价值的教学反馈和改进建议。

衍生相关工作

基于MMLU数据集，研究者们开展了一系列相关工作，如开发针对特定学科领域的语言模型、研究模型在不同学科领域的知识迁移能力等。这些工作不仅推动了多任务语言模型的发展，也为教育、科研等领域带来了新的应用可能性。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集