open-llm-leaderboard-old/details_microsoft__phi-2
收藏Hugging Face2024-04-15 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_microsoft__phi-2
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Evaluation run of microsoft/phi-2
dataset_summary: "Dataset automatically created during the evaluation run of model\
\ [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\
\nThe dataset is composed of 63 configuration, each one coresponding to one of the\
\ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\
\ found as a specific split in each configuration, the split being named using the\
\ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\
\nAn additional configuration \"results\" store all the aggregated results of the\
\ run (and is used to compute and display the aggregated metrics on the [Open LLM\
\ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\
\nTo load the details from a run, you can for instance do the following:\n```python\n\
from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_microsoft__phi-2\"\
,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\
These are the [latest results from run 2024-04-15T16:12:26.100927](https://huggingface.co/datasets/open-llm-leaderboard/details_microsoft__phi-2/blob/main/results_2024-04-15T16-12-26.100927.json)(note\
\ that their might be results for other tasks in the repos if successive evals didn't\
\ cover the same tasks. You find each in the results and the \"latest\" split for\
\ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.5810413550660194,\n\
\ \"acc_stderr\": 0.033772948029595365,\n \"acc_norm\": 0.5826063358809028,\n\
\ \"acc_norm_stderr\": 0.03446197582267999,\n \"mc1\": 0.30966952264381886,\n\
\ \"mc1_stderr\": 0.016185744355144912,\n \"mc2\": 0.4423687837225679,\n\
\ \"mc2_stderr\": 0.015079580060665993\n },\n \"harness|arc:challenge|25\"\
: {\n \"acc\": 0.5827645051194539,\n \"acc_stderr\": 0.01440982551840308,\n\
\ \"acc_norm\": 0.6100682593856656,\n \"acc_norm_stderr\": 0.014252959848892896\n\
\ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.5617406891057558,\n\
\ \"acc_stderr\": 0.004951594063272055,\n \"acc_norm\": 0.7491535550687114,\n\
\ \"acc_norm_stderr\": 0.004326143430360092\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\
: {\n \"acc\": 0.29,\n \"acc_stderr\": 0.045604802157206845,\n \
\ \"acc_norm\": 0.29,\n \"acc_norm_stderr\": 0.045604802157206845\n \
\ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.4444444444444444,\n\
\ \"acc_stderr\": 0.04292596718256981,\n \"acc_norm\": 0.4444444444444444,\n\
\ \"acc_norm_stderr\": 0.04292596718256981\n },\n \"harness|hendrycksTest-astronomy|5\"\
: {\n \"acc\": 0.5855263157894737,\n \"acc_stderr\": 0.040089737857792046,\n\
\ \"acc_norm\": 0.5855263157894737,\n \"acc_norm_stderr\": 0.040089737857792046\n\
\ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.56,\n\
\ \"acc_stderr\": 0.04988876515698589,\n \"acc_norm\": 0.56,\n \
\ \"acc_norm_stderr\": 0.04988876515698589\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\
: {\n \"acc\": 0.6037735849056604,\n \"acc_stderr\": 0.030102793781791197,\n\
\ \"acc_norm\": 0.6037735849056604,\n \"acc_norm_stderr\": 0.030102793781791197\n\
\ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.6666666666666666,\n\
\ \"acc_stderr\": 0.03942082639927213,\n \"acc_norm\": 0.6666666666666666,\n\
\ \"acc_norm_stderr\": 0.03942082639927213\n },\n \"harness|hendrycksTest-college_chemistry|5\"\
: {\n \"acc\": 0.4,\n \"acc_stderr\": 0.04923659639173309,\n \
\ \"acc_norm\": 0.4,\n \"acc_norm_stderr\": 0.04923659639173309\n },\n\
\ \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\": 0.41,\n\
\ \"acc_stderr\": 0.049431107042371025,\n \"acc_norm\": 0.41,\n \
\ \"acc_norm_stderr\": 0.049431107042371025\n },\n \"harness|hendrycksTest-college_mathematics|5\"\
: {\n \"acc\": 0.38,\n \"acc_stderr\": 0.048783173121456344,\n \
\ \"acc_norm\": 0.38,\n \"acc_norm_stderr\": 0.048783173121456344\n \
\ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.5895953757225434,\n\
\ \"acc_stderr\": 0.03750757044895537,\n \"acc_norm\": 0.5895953757225434,\n\
\ \"acc_norm_stderr\": 0.03750757044895537\n },\n \"harness|hendrycksTest-college_physics|5\"\
: {\n \"acc\": 0.37254901960784315,\n \"acc_stderr\": 0.048108401480826346,\n\
\ \"acc_norm\": 0.37254901960784315,\n \"acc_norm_stderr\": 0.048108401480826346\n\
\ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\
\ 0.74,\n \"acc_stderr\": 0.04408440022768078,\n \"acc_norm\": 0.74,\n\
\ \"acc_norm_stderr\": 0.04408440022768078\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\
: {\n \"acc\": 0.5234042553191489,\n \"acc_stderr\": 0.032650194750335815,\n\
\ \"acc_norm\": 0.5234042553191489,\n \"acc_norm_stderr\": 0.032650194750335815\n\
\ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.38596491228070173,\n\
\ \"acc_stderr\": 0.04579639422070434,\n \"acc_norm\": 0.38596491228070173,\n\
\ \"acc_norm_stderr\": 0.04579639422070434\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\
: {\n \"acc\": 0.5517241379310345,\n \"acc_stderr\": 0.04144311810878152,\n\
\ \"acc_norm\": 0.5517241379310345,\n \"acc_norm_stderr\": 0.04144311810878152\n\
\ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\
: 0.4603174603174603,\n \"acc_stderr\": 0.025670080636909186,\n \"\
acc_norm\": 0.4603174603174603,\n \"acc_norm_stderr\": 0.025670080636909186\n\
\ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.35714285714285715,\n\
\ \"acc_stderr\": 0.04285714285714281,\n \"acc_norm\": 0.35714285714285715,\n\
\ \"acc_norm_stderr\": 0.04285714285714281\n },\n \"harness|hendrycksTest-global_facts|5\"\
: {\n \"acc\": 0.38,\n \"acc_stderr\": 0.048783173121456316,\n \
\ \"acc_norm\": 0.38,\n \"acc_norm_stderr\": 0.048783173121456316\n \
\ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\"\
: 0.6709677419354839,\n \"acc_stderr\": 0.026729499068349958,\n \"\
acc_norm\": 0.6709677419354839,\n \"acc_norm_stderr\": 0.026729499068349958\n\
\ },\n \"harness|hendrycksTest-high_school_chemistry|5\": {\n \"acc\"\
: 0.4729064039408867,\n \"acc_stderr\": 0.03512819077876106,\n \"\
acc_norm\": 0.4729064039408867,\n \"acc_norm_stderr\": 0.03512819077876106\n\
\ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \
\ \"acc\": 0.64,\n \"acc_stderr\": 0.048241815132442176,\n \"acc_norm\"\
: 0.64,\n \"acc_norm_stderr\": 0.048241815132442176\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\
: {\n \"acc\": 0.6484848484848484,\n \"acc_stderr\": 0.0372820699868265,\n\
\ \"acc_norm\": 0.6484848484848484,\n \"acc_norm_stderr\": 0.0372820699868265\n\
\ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\
: 0.7525252525252525,\n \"acc_stderr\": 0.030746300742124498,\n \"\
acc_norm\": 0.7525252525252525,\n \"acc_norm_stderr\": 0.030746300742124498\n\
\ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\
\ \"acc\": 0.8082901554404145,\n \"acc_stderr\": 0.028408953626245282,\n\
\ \"acc_norm\": 0.8082901554404145,\n \"acc_norm_stderr\": 0.028408953626245282\n\
\ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \
\ \"acc\": 0.5820512820512821,\n \"acc_stderr\": 0.02500732988246122,\n \
\ \"acc_norm\": 0.5820512820512821,\n \"acc_norm_stderr\": 0.02500732988246122\n\
\ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\
acc\": 0.3296296296296296,\n \"acc_stderr\": 0.02866120111652458,\n \
\ \"acc_norm\": 0.3296296296296296,\n \"acc_norm_stderr\": 0.02866120111652458\n\
\ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \
\ \"acc\": 0.6218487394957983,\n \"acc_stderr\": 0.031499305777849054,\n\
\ \"acc_norm\": 0.6218487394957983,\n \"acc_norm_stderr\": 0.031499305777849054\n\
\ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\
: 0.3841059602649007,\n \"acc_stderr\": 0.03971301814719197,\n \"\
acc_norm\": 0.3841059602649007,\n \"acc_norm_stderr\": 0.03971301814719197\n\
\ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\
: 0.7926605504587156,\n \"acc_stderr\": 0.017381415563608674,\n \"\
acc_norm\": 0.7926605504587156,\n \"acc_norm_stderr\": 0.017381415563608674\n\
\ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\
: 0.4722222222222222,\n \"acc_stderr\": 0.0340470532865388,\n \"acc_norm\"\
: 0.4722222222222222,\n \"acc_norm_stderr\": 0.0340470532865388\n },\n\
\ \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\": 0.6666666666666666,\n\
\ \"acc_stderr\": 0.03308611113236436,\n \"acc_norm\": 0.6666666666666666,\n\
\ \"acc_norm_stderr\": 0.03308611113236436\n },\n \"harness|hendrycksTest-high_school_world_history|5\"\
: {\n \"acc\": 0.7426160337552743,\n \"acc_stderr\": 0.02845882099146029,\n\
\ \"acc_norm\": 0.7426160337552743,\n \"acc_norm_stderr\": 0.02845882099146029\n\
\ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6591928251121076,\n\
\ \"acc_stderr\": 0.03181149747055359,\n \"acc_norm\": 0.6591928251121076,\n\
\ \"acc_norm_stderr\": 0.03181149747055359\n },\n \"harness|hendrycksTest-human_sexuality|5\"\
: {\n \"acc\": 0.7022900763358778,\n \"acc_stderr\": 0.040103589424622034,\n\
\ \"acc_norm\": 0.7022900763358778,\n \"acc_norm_stderr\": 0.040103589424622034\n\
\ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\
\ 0.7272727272727273,\n \"acc_stderr\": 0.04065578140908705,\n \"\
acc_norm\": 0.7272727272727273,\n \"acc_norm_stderr\": 0.04065578140908705\n\
\ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7314814814814815,\n\
\ \"acc_stderr\": 0.042844679680521934,\n \"acc_norm\": 0.7314814814814815,\n\
\ \"acc_norm_stderr\": 0.042844679680521934\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\
: {\n \"acc\": 0.7300613496932515,\n \"acc_stderr\": 0.034878251684978906,\n\
\ \"acc_norm\": 0.7300613496932515,\n \"acc_norm_stderr\": 0.034878251684978906\n\
\ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.48214285714285715,\n\
\ \"acc_stderr\": 0.047427623612430116,\n \"acc_norm\": 0.48214285714285715,\n\
\ \"acc_norm_stderr\": 0.047427623612430116\n },\n \"harness|hendrycksTest-management|5\"\
: {\n \"acc\": 0.7087378640776699,\n \"acc_stderr\": 0.044986763205729224,\n\
\ \"acc_norm\": 0.7087378640776699,\n \"acc_norm_stderr\": 0.044986763205729224\n\
\ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8247863247863247,\n\
\ \"acc_stderr\": 0.02490443909891824,\n \"acc_norm\": 0.8247863247863247,\n\
\ \"acc_norm_stderr\": 0.02490443909891824\n },\n \"harness|hendrycksTest-medical_genetics|5\"\
: {\n \"acc\": 0.64,\n \"acc_stderr\": 0.04824181513244218,\n \
\ \"acc_norm\": 0.64,\n \"acc_norm_stderr\": 0.04824181513244218\n \
\ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.6922094508301405,\n\
\ \"acc_stderr\": 0.016506045045155637,\n \"acc_norm\": 0.6922094508301405,\n\
\ \"acc_norm_stderr\": 0.016506045045155637\n },\n \"harness|hendrycksTest-moral_disputes|5\"\
: {\n \"acc\": 0.6647398843930635,\n \"acc_stderr\": 0.02541600377316554,\n\
\ \"acc_norm\": 0.6647398843930635,\n \"acc_norm_stderr\": 0.02541600377316554\n\
\ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.2994413407821229,\n\
\ \"acc_stderr\": 0.015318257745976706,\n \"acc_norm\": 0.2994413407821229,\n\
\ \"acc_norm_stderr\": 0.015318257745976706\n },\n \"harness|hendrycksTest-nutrition|5\"\
: {\n \"acc\": 0.6209150326797386,\n \"acc_stderr\": 0.02778014120702334,\n\
\ \"acc_norm\": 0.6209150326797386,\n \"acc_norm_stderr\": 0.02778014120702334\n\
\ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.6205787781350482,\n\
\ \"acc_stderr\": 0.02755994980234782,\n \"acc_norm\": 0.6205787781350482,\n\
\ \"acc_norm_stderr\": 0.02755994980234782\n },\n \"harness|hendrycksTest-prehistory|5\"\
: {\n \"acc\": 0.6203703703703703,\n \"acc_stderr\": 0.02700252103451646,\n\
\ \"acc_norm\": 0.6203703703703703,\n \"acc_norm_stderr\": 0.02700252103451646\n\
\ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\
acc\": 0.4397163120567376,\n \"acc_stderr\": 0.02960991207559411,\n \
\ \"acc_norm\": 0.4397163120567376,\n \"acc_norm_stderr\": 0.02960991207559411\n\
\ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4367666232073012,\n\
\ \"acc_stderr\": 0.012667701919603664,\n \"acc_norm\": 0.4367666232073012,\n\
\ \"acc_norm_stderr\": 0.012667701919603664\n },\n \"harness|hendrycksTest-professional_medicine|5\"\
: {\n \"acc\": 0.48161764705882354,\n \"acc_stderr\": 0.03035230339535196,\n\
\ \"acc_norm\": 0.48161764705882354,\n \"acc_norm_stderr\": 0.03035230339535196\n\
\ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\
acc\": 0.5637254901960784,\n \"acc_stderr\": 0.02006287424353913,\n \
\ \"acc_norm\": 0.5637254901960784,\n \"acc_norm_stderr\": 0.02006287424353913\n\
\ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6272727272727273,\n\
\ \"acc_stderr\": 0.04631381319425465,\n \"acc_norm\": 0.6272727272727273,\n\
\ \"acc_norm_stderr\": 0.04631381319425465\n },\n \"harness|hendrycksTest-security_studies|5\"\
: {\n \"acc\": 0.7183673469387755,\n \"acc_stderr\": 0.02879518557429129,\n\
\ \"acc_norm\": 0.7183673469387755,\n \"acc_norm_stderr\": 0.02879518557429129\n\
\ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8009950248756219,\n\
\ \"acc_stderr\": 0.028231365092758406,\n \"acc_norm\": 0.8009950248756219,\n\
\ \"acc_norm_stderr\": 0.028231365092758406\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\
: {\n \"acc\": 0.77,\n \"acc_stderr\": 0.042295258468165065,\n \
\ \"acc_norm\": 0.77,\n \"acc_norm_stderr\": 0.042295258468165065\n \
\ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.46987951807228917,\n\
\ \"acc_stderr\": 0.03885425420866767,\n \"acc_norm\": 0.46987951807228917,\n\
\ \"acc_norm_stderr\": 0.03885425420866767\n },\n \"harness|hendrycksTest-world_religions|5\"\
: {\n \"acc\": 0.695906432748538,\n \"acc_stderr\": 0.03528211258245231,\n\
\ \"acc_norm\": 0.695906432748538,\n \"acc_norm_stderr\": 0.03528211258245231\n\
\ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.30966952264381886,\n\
\ \"mc1_stderr\": 0.016185744355144912,\n \"mc2\": 0.4423687837225679,\n\
\ \"mc2_stderr\": 0.015079580060665993\n },\n \"harness|winogrande|5\"\
: {\n \"acc\": 0.7348066298342542,\n \"acc_stderr\": 0.012406549466192858\n\
\ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.5496588324488249,\n \
\ \"acc_stderr\": 0.013704390498582809\n }\n}\n```"
repo_url: https://huggingface.co/microsoft/phi-2
leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
point_of_contact: clementine@hf.co
configs:
- config_name: harness_arc_challenge_25
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|arc:challenge|25_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|arc:challenge|25_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|arc:challenge|25_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_gsm8k_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|gsm8k|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|gsm8k|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|gsm8k|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hellaswag_10
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hellaswag|10_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hellaswag|10_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hellaswag|10_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-management|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-virology|5_2023-12-14T09-31-24.484620.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-management|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-virology|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-management|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-virology|5_2024-04-15T16-12-26.100927.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_abstract_algebra_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_anatomy_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-anatomy|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_astronomy_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-astronomy|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_business_ethics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_clinical_knowledge_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_college_biology_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-college_biology|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_college_chemistry_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_college_computer_science_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_college_mathematics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_college_medicine_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_college_physics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-college_physics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_computer_security_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-computer_security|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_conceptual_physics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_econometrics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-econometrics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_electrical_engineering_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_elementary_mathematics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_formal_logic_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_global_facts_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-global_facts|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_biology_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_chemistry_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_computer_science_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_european_history_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_geography_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_government_and_politics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_macroeconomics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_mathematics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_microeconomics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_physics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_psychology_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_statistics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_us_history_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_high_school_world_history_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_human_aging_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-human_aging|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_human_sexuality_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_international_law_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-international_law|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-international_law|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-international_law|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_jurisprudence_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_logical_fallacies_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_machine_learning_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_management_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-management|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-management|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-management|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_marketing_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-marketing|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-marketing|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-marketing|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_medical_genetics_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_miscellaneous_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_moral_disputes_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_moral_scenarios_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_nutrition_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-nutrition|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_philosophy_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-philosophy|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_prehistory_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-prehistory|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_professional_accounting_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_professional_law_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-professional_law|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_professional_medicine_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_professional_psychology_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_public_relations_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-public_relations|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_security_studies_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-security_studies|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_sociology_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-sociology|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-sociology|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-sociology|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_us_foreign_policy_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_virology_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-virology|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-virology|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-virology|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_hendrycksTest_world_religions_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|hendrycksTest-world_religions|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_truthfulqa_mc_0
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|truthfulqa:mc|0_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|truthfulqa:mc|0_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|truthfulqa:mc|0_2024-04-15T16-12-26.100927.parquet'
- config_name: harness_winogrande_5
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- '**/details_harness|winogrande|5_2023-12-14T09-31-24.484620.parquet'
- split: 2024_04_15T16_12_26.100927
path:
- '**/details_harness|winogrande|5_2024-04-15T16-12-26.100927.parquet'
- split: latest
path:
- '**/details_harness|winogrande|5_2024-04-15T16-12-26.100927.parquet'
- config_name: results
data_files:
- split: 2023_12_14T09_31_24.484620
path:
- results_2023-12-14T09-31-24.484620.parquet
- split: 2024_04_15T16_12_26.100927
path:
- results_2024-04-15T16-12-26.100927.parquet
- split: latest
path:
- results_2024-04-15T16-12-26.100927.parquet
---
# Dataset Card for Evaluation run of microsoft/phi-2
<!-- Provide a quick summary of the dataset. -->
Dataset automatically created during the evaluation run of model [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task.
The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results.
An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).
To load the details from a run, you can for instance do the following:
```python
from datasets import load_dataset
data = load_dataset("open-llm-leaderboard/details_microsoft__phi-2",
"harness_winogrande_5",
split="train")
```
## Latest results
These are the [latest results from run 2024-04-15T16:12:26.100927](https://huggingface.co/datasets/open-llm-leaderboard/details_microsoft__phi-2/blob/main/results_2024-04-15T16-12-26.100927.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval):
```python
{
"all": {
"acc": 0.5810413550660194,
"acc_stderr": 0.033772948029595365,
"acc_norm": 0.5826063358809028,
"acc_norm_stderr": 0.03446197582267999,
"mc1": 0.30966952264381886,
"mc1_stderr": 0.016185744355144912,
"mc2": 0.4423687837225679,
"mc2_stderr": 0.015079580060665993
},
"harness|arc:challenge|25": {
"acc": 0.5827645051194539,
"acc_stderr": 0.01440982551840308,
"acc_norm": 0.6100682593856656,
"acc_norm_stderr": 0.014252959848892896
},
"harness|hellaswag|10": {
"acc": 0.5617406891057558,
"acc_stderr": 0.004951594063272055,
"acc_norm": 0.7491535550687114,
"acc_norm_stderr": 0.004326143430360092
},
"harness|hendrycksTest-abstract_algebra|5": {
"acc": 0.29,
"acc_stderr": 0.045604802157206845,
"acc_norm": 0.29,
"acc_norm_stderr": 0.045604802157206845
},
"harness|hendrycksTest-anatomy|5": {
"acc": 0.4444444444444444,
"acc_stderr": 0.04292596718256981,
"acc_norm": 0.4444444444444444,
"acc_norm_stderr": 0.04292596718256981
},
"harness|hendrycksTest-astronomy|5": {
"acc": 0.5855263157894737,
"acc_stderr": 0.040089737857792046,
"acc_norm": 0.5855263157894737,
"acc_norm_stderr": 0.040089737857792046
},
"harness|hendrycksTest-business_ethics|5": {
"acc": 0.56,
"acc_stderr": 0.04988876515698589,
"acc_norm": 0.56,
"acc_norm_stderr": 0.04988876515698589
},
"harness|hendrycksTest-clinical_knowledge|5": {
"acc": 0.6037735849056604,
"acc_stderr": 0.030102793781791197,
"acc_norm": 0.6037735849056604,
"acc_norm_stderr": 0.030102793781791197
},
"harness|hendrycksTest-college_biology|5": {
"acc": 0.6666666666666666,
"acc_stderr": 0.03942082639927213,
"acc_norm": 0.6666666666666666,
"acc_norm_stderr": 0.03942082639927213
},
"harness|hendrycksTest-college_chemistry|5": {
"acc": 0.4,
"acc_stderr": 0.04923659639173309,
"acc_norm": 0.4,
"acc_norm_stderr": 0.04923659639173309
},
"harness|hendrycksTest-college_computer_science|5": {
"acc": 0.41,
"acc_stderr": 0.049431107042371025,
"acc_norm": 0.41,
"acc_norm_stderr": 0.049431107042371025
},
"harness|hendrycksTest-college_mathematics|5": {
"acc": 0.38,
"acc_stderr": 0.048783173121456344,
"acc_norm": 0.38,
"acc_norm_stderr": 0.048783173121456344
},
"harness|hendrycksTest-college_medicine|5": {
"acc": 0.5895953757225434,
"acc_stderr": 0.03750757044895537,
"acc_norm": 0.5895953757225434,
"acc_norm_stderr": 0.03750757044895537
},
"harness|hendrycksTest-college_physics|5": {
"acc": 0.37254901960784315,
"acc_stderr": 0.048108401480826346,
"acc_norm": 0.37254901960784315,
"acc_norm_stderr": 0.048108401480826346
},
"harness|hendrycksTest-computer_security|5": {
"acc": 0.74,
"acc_stderr": 0.04408440022768078,
"acc_norm": 0.74,
"acc_norm_stderr": 0.04408440022768078
},
"harness|hendrycksTest-conceptual_physics|5": {
"acc": 0.5234042553191489,
"acc_stderr": 0.032650194750335815,
"acc_norm": 0.5234042553191489,
"acc_norm_stderr": 0.032650194750335815
},
"harness|hendrycksTest-econometrics|5": {
"acc": 0.38596491228070173,
"acc_stderr": 0.04579639422070434,
"acc_norm": 0.38596491228070173,
"acc_norm_stderr": 0.04579639422070434
},
"harness|hendrycksTest-electrical_engineering|5": {
"acc": 0.5517241379310345,
"acc_stderr": 0.04144311810878152,
"acc_norm": 0.5517241379310345,
"acc_norm_stderr": 0.04144311810878152
},
"harness|hendrycksTest-elementary_mathematics|5": {
"acc": 0.4603174603174603,
"acc_stderr": 0.025670080636909186,
"acc_norm": 0.4603174603174603,
"acc_norm_stderr": 0.025670080636909186
},
"harness|hendrycksTest-formal_logic|5": {
"acc": 0.35714285714285715,
"acc_stderr": 0.04285714285714281,
"acc_norm": 0.35714285714285715,
"acc_norm_stderr": 0.04285714285714281
},
"harness|hendrycksTest-global_facts|5": {
"acc": 0.38,
"acc_stderr": 0.048783173121456316,
"acc_norm": 0.38,
"acc_norm_stderr": 0.048783173121456316
},
"harness|hendrycksTest-high_school_biology|5": {
"acc": 0.6709677419354839,
"acc_stderr": 0.026729499068349958,
"acc_norm": 0.6709677419354839,
"acc_norm_stderr": 0.026729499068349958
},
"harness|hendrycksTest-high_school_chemistry|5": {
"acc": 0.4729064039408867,
"acc_stderr": 0.03512819077876106,
"acc_norm": 0.4729064039408867,
"acc_norm_stderr": 0.03512819077876106
},
"harness|hendrycksTest-high_school_computer_science|5": {
"acc": 0.64,
"acc_stderr": 0.048241815132442176,
"acc_norm": 0.64,
"acc_norm_stderr": 0.048241815132442176
},
"harness|hendrycksTest-high_school_european_history|5": {
"acc": 0.6484848484848484,
"acc_stderr": 0.0372820699868265,
"acc_norm": 0.6484848484848484,
"acc_norm_stderr": 0.0372820699868265
},
"harness|hendrycksTest-high_school_geography|5": {
"acc": 0.7525252525252525,
"acc_stderr": 0.030746300742124498,
"acc_norm": 0.7525252525252525,
"acc_norm_stderr": 0.030746300742124498
},
"harness|hendrycksTest-high_school_government_and_politics|5": {
"acc": 0.8082901554404145,
"acc_stderr": 0.028408953626245282,
"acc_norm": 0.8082901554404145,
"acc_norm_stderr": 0.028408953626245282
},
"harness|hendrycksTest-high_school_macroeconomics|5": {
"acc": 0.5820512820512821,
"acc_stderr": 0.02500732988246122,
"acc_norm": 0.5820512820512821,
"acc_norm_stderr": 0.02500732988246122
},
"harness|hendrycksTest-high_school_mathematics|5": {
"acc": 0.3296296296296296,
"acc_stderr": 0.02866120111652458,
"acc_norm": 0.3296296296296296,
"acc_norm_stderr": 0.02866120111652458
},
"harness|hendrycksTest-high_school_microeconomics|5": {
"acc": 0.6218487394957983,
"acc_stderr": 0.031499305777849054,
"acc_norm": 0.6218487394957983,
"acc_norm_stderr": 0.031499305777849054
},
"harness|hendrycksTest-high_school_physics|5": {
"acc": 0.3841059602649007,
"acc_stderr": 0.03971301814719197,
"acc_norm": 0.3841059602649007,
"acc_norm_stderr": 0.03971301814719197
},
"harness|hendrycksTest-high_school_psychology|5": {
"acc": 0.7926605504587156,
"acc_stderr": 0.017381415563608674,
"acc_norm": 0.7926605504587156,
"acc_norm_stderr": 0.017381415563608674
},
"harness|hendrycksTest-high_school_statistics|5": {
"acc": 0.4722222222222222,
"acc_stderr": 0.0340470532865388,
"acc_norm": 0.4722222222222222,
"acc_norm_stderr": 0.0340470532865388
},
"harness|hendrycksTest-high_school_us_history|5": {
"acc": 0.6666666666666666,
"acc_stderr": 0.03308611113236436,
"acc_norm": 0.6666666666666666,
"acc_norm_stderr": 0.03308611113236436
},
"harness|hendrycksTest-high_school_world_history|5": {
"acc": 0.7426160337552743,
"acc_stderr": 0.02845882099146029,
"acc_norm": 0.7426160337552743,
"acc_norm_stderr": 0.02845882099146029
},
"harness|hendrycksTest-human_aging|5": {
"acc": 0.6591928251121076,
"acc_stderr": 0.03181149747055359,
"acc_norm": 0.6591928251121076,
"acc_norm_stderr": 0.03181149747055359
},
"harness|hendrycksTest-human_sexuality|5": {
"acc": 0.7022900763358778,
"acc_stderr": 0.040103589424622034,
"acc_norm": 0.7022900763358778,
"acc_norm_stderr": 0.040103589424622034
},
"harness|hendrycksTest-international_law|5": {
"acc": 0.7272727272727273,
"acc_stderr": 0.04065578140908705,
"acc_norm": 0.7272727272727273,
"acc_norm_stderr": 0.04065578140908705
},
"harness|hendrycksTest-jurisprudence|5": {
"acc": 0.7314814814814815,
"acc_stderr": 0.042844679680521934,
"acc_norm": 0.7314814814814815,
"acc_norm_stderr": 0.042844679680521934
},
"harness|hendrycksTest-logical_fallacies|5": {
"acc": 0.7300613496932515,
"acc_stderr": 0.034878251684978906,
"acc_norm": 0.7300613496932515,
"acc_norm_stderr": 0.034878251684978906
},
"harness|hendrycksTest-machine_learning|5": {
"acc": 0.48214285714285715,
"acc_stderr": 0.047427623612430116,
"acc_norm": 0.48214285714285715,
"acc_norm_stderr": 0.047427623612430116
},
"harness|hendrycksTest-management|5": {
"acc": 0.7087378640776699,
"acc_stderr": 0.044986763205729224,
"acc_norm": 0.7087378640776699,
"acc_norm_stderr": 0.044986763205729224
},
"harness|hendrycksTest-marketing|5": {
"acc": 0.8247863247863247,
"acc_stderr": 0.02490443909891824,
"acc_norm": 0.8247863247863247,
"acc_norm_stderr": 0.02490443909891824
},
"harness|hendrycksTest-medical_genetics|5": {
"acc": 0.64,
"acc_stderr": 0.04824181513244218,
"acc_norm": 0.64,
"acc_norm_stderr": 0.04824181513244218
},
"harness|hendrycksTest-miscellaneous|5": {
"acc": 0.6922094508301405,
"acc_stderr": 0.016506045045155637,
"acc_norm": 0.6922094508301405,
"acc_norm_stderr": 0.016506045045155637
},
"harness|hendrycksTest-moral_disputes|5": {
"acc": 0.6647398843930635,
"acc_stderr": 0.02541600377316554,
"acc_norm": 0.6647398843930635,
"acc_norm_stderr": 0.02541600377316554
},
"harness|hendrycksTest-moral_scenarios|5": {
"acc": 0.2994413407821229,
"acc_stderr": 0.015318257745976706,
"acc_norm": 0.2994413407821229,
"acc_norm_stderr": 0.015318257745976706
},
"harness|hendrycksTest-nutrition|5": {
"acc": 0.6209150326797386,
"acc_stderr": 0.02778014120702334,
"acc_norm": 0.6209150326797386,
"acc_norm_stderr": 0.02778014120702334
},
"harness|hendrycksTest-philosophy|5": {
"acc": 0.6205787781350482,
"acc_stderr": 0.02755994980234782,
"acc_norm": 0.6205787781350482,
"acc_norm_stderr": 0.02755994980234782
},
"harness|hendrycksTest-prehistory|5": {
"acc": 0.6203703703703703,
"acc_stderr": 0.02700252103451646,
"acc_norm": 0.6203703703703703,
"acc_norm_stderr": 0.02700252103451646
},
"harness|hendrycksTest-professional_accounting|5": {
"acc": 0.4397163120567376,
"acc_stderr": 0.02960991207559411,
"acc_norm": 0.4397163120567376,
"acc_norm_stderr": 0.02960991207559411
},
"harness|hendrycksTest-professional_law|5": {
"acc": 0.4367666232073012,
"acc_stderr": 0.012667701919603664,
"acc_norm": 0.4367666232073012,
"acc_norm_stderr": 0.012667701919603664
},
"harness|hendrycksTest-professional_medicine|5": {
"acc": 0.48161764705882354,
"acc_stderr": 0.03035230339535196,
"acc_norm": 0.48161764705882354,
"acc_norm_stderr": 0.03035230339535196
},
"harness|hendrycksTest-professional_psychology|5": {
"acc": 0.5637254901960784,
"acc_stderr": 0.02006287424353913,
"acc_norm": 0.5637254901960784,
"acc_norm_stderr": 0.02006287424353913
},
"harness|hendrycksTest-public_relations|5": {
"acc": 0.6272727272727273,
"acc_stderr": 0.04631381319425465,
"acc_norm": 0.6272727272727273,
"acc_norm_stderr": 0.04631381319425465
},
"harness|hendrycksTest-security_studies|5": {
"acc": 0.7183673469387755,
"acc_stderr": 0.02879518557429129,
"acc_norm": 0.7183673469387755,
"acc_norm_stderr": 0.02879518557429129
},
"harness|hendrycksTest-sociology|5": {
"acc": 0.8009950248756219,
"acc_stderr": 0.028231365092758406,
"acc_norm": 0.8009950248756219,
"acc_norm_stderr": 0.028231365092758406
},
"harness|hendrycksTest-us_foreign_policy|5": {
"acc": 0.77,
"acc_stderr": 0.042295258468165065,
"acc_norm": 0.77,
"acc_norm_stderr": 0.042295258468165065
},
"harness|hendrycksTest-virology|5": {
"acc": 0.46987951807228917,
"acc_stderr": 0.03885425420866767,
"acc_norm": 0.46987951807228917,
"acc_norm_stderr": 0.03885425420866767
},
"harness|hendrycksTest-world_religions|5": {
"acc": 0.695906432748538,
"acc_stderr": 0.03528211258245231,
"acc_norm": 0.695906432748538,
"acc_norm_stderr": 0.03528211258245231
},
"harness|truthfulqa:mc|0": {
"mc1": 0.30966952264381886,
"mc1_stderr": 0.016185744355144912,
"mc2": 0.4423687837225679,
"mc2_stderr": 0.015079580060665993
},
"harness|winogrande|5": {
"acc": 0.7348066298342542,
"acc_stderr": 0.012406549466192858
},
"harness|gsm8k|5": {
"acc": 0.5496588324488249,
"acc_stderr": 0.013704390498582809
}
}
```
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
- **Curated by:** [More Information Needed]
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Language(s) (NLP):** [More Information Needed]
- **License:** [More Information Needed]
### Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
[More Information Needed]
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
[More Information Needed]
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
[More Information Needed]
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
[More Information Needed]
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
[More Information Needed]
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
[More Information Needed]
### Annotations [optional]
<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->
#### Annotation process
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
[More Information Needed]
#### Who are the annotators?
<!-- This section describes the people or systems who created the annotations. -->
[More Information Needed]
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
[More Information Needed]
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[More Information Needed]
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations.
## Citation [optional]
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Dataset Card Authors [optional]
[More Information Needed]
## Dataset Card Contact
[More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总
数据集概述
数据集基本信息
- 名称: Evaluation run of microsoft/phi-2
- 来源: 在模型 microsoft/phi-2 的评估运行期间自动创建
- 目的: 用于 Open LLM Leaderboard 的评估
数据集结构
- 配置数量: 63
- 每个配置对应: 一个评估任务
- 创建次数: 2次
- 分割命名: 使用运行的时间戳
- 最新结果分割: "train" 分割指向最新结果
额外配置
- "results" 配置: 存储所有运行的聚合结果,用于计算和显示聚合指标
数据加载示例
python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_microsoft__phi-2", "harness_winogrande_5", split="train")
最新结果
- 最新结果日期: 2024-04-15T16:12:26.100927
- 结果详情: 包含多个任务的准确率(acc)、标准化准确率(acc_norm)、误差(stderr)等指标
配置详情
-
配置名称: harness_arc_challenge_25
- 分割: 2023_12_14T09_31_24.484620, 2024_04_15T16_12_26.100927, latest
- 路径: 对应的 parquet 文件路径
-
配置名称: harness_gsm8k_5
- 分割: 2023_12_14T09_31_24.484620, 2024_04_15T16_12_26.100927, latest
- 路径: 对应的 parquet 文件路径
-
配置名称: harness_hellaswag_10
- 分割: 2023_12_14T09_31_24.484620, 2024_04_15T16_12_26.100927, latest
- 路径: 对应的 parquet 文件路径
-
配置名称: harness_hendrycksTest_5
- 分割: 2023_12_14T09_31_24.484620, 2024_04_15T16_12_26.100927, latest
- 路径: 对应的 parquet 文件路径
搜集汇总
数据集介绍

构建方式
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard评估流程的自动化产物而构建。其核心机制依托于对microsoft/phi-2模型在标准评测任务上的系统性测试。数据集通过两次独立的评估运行生成,每次运行对应一个以时间戳命名的数据切分,并最终整合为涵盖63项具体评测任务的配置集合。一个名为“results”的特殊配置则汇总了所有运行的聚合指标,为模型能力的宏观衡量提供了结构化数据基础。
特点
该数据集展现了评测数据的高度结构化与版本化特征。其以配置为单位组织数据,每个配置对应一项具体的评测任务,如ARC挑战赛或HellaSwag,从而实现了对模型多维度能力的精细刻画。数据集通过“latest”切分始终指向最新的评估结果,确保了信息的时效性,而保留历史运行切分则支持了模型性能的纵向比较与分析。这种设计使得数据集不仅是一份静态的性能记录,更成为一个动态的、可追溯的评估档案。
使用方法
研究人员可利用`datasets`库便捷地加载该数据集以进行深入分析。通过指定数据集名称、目标配置(例如“harness_winogrande_5”)以及所需的切分(如“train”代表最新结果),即可获取对应任务下模型评估的详细记录。这些记录为分析模型在不同知识领域、推理任务上的具体表现、误差模式以及性能演变提供了原始数据支撑,是进行模型对比、短板诊断和评估方法研究的重要资源。
背景与挑战
背景概述
在大型语言模型(LLM)迅猛发展的背景下,对模型性能进行系统化、标准化的评估成为推动领域进步的关键。Open LLM Leaderboard作为HuggingFace平台上的权威评测框架,旨在通过多维度基准测试客观衡量不同模型的综合能力。该数据集记录了微软研发的Phi-2模型在2023年至2024年间于该排行榜上的详细评估结果,涵盖了ARC挑战赛、HellaSwag、MMLU以及TruthfulQA等63项核心任务。其创建源于研究社区对透明、可复现评测体系的迫切需求,通过自动化流程生成结构化数据,为模型比较与优化提供了实证基础,显著促进了开源模型生态的健康发展。
当前挑战
该数据集所应对的核心挑战在于如何全面、公正地评估大型语言模型在复杂认知任务上的表现。具体而言,评测需涵盖常识推理、专业知识、数学计算及伦理判断等多个维度,同时确保任务设计能有效区分模型能力的细微差异。在构建过程中,挑战主要体现在评测流程的自动化与标准化上:需整合多样化的评测框架(如lm-evaluation-harness),处理不同任务的数据格式与指标统一,并保证多次评估结果的可追溯性与一致性。此外,随着模型迭代与评测基准的更新,数据集的维护需持续适应新的任务配置与评估标准,这对数据版本管理与动态扩展提出了较高要求。
常用场景
经典使用场景
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard的评估运行结果,其经典使用场景在于为研究人员提供Phi-2模型在多样化基准测试中的详细性能数据。通过涵盖ARC挑战、HellaSwag、MMLU及TruthfulQA等63项任务配置,数据集允许对模型在常识推理、知识问答、数学解题及真实性判断等核心能力进行横向对比分析,为模型能力图谱的绘制提供了结构化基准。
解决学术问题
该数据集有效解决了大语言模型评估中标准化与可复现性的关键学术问题。通过统一评估框架下的多任务细粒度结果,研究者能够深入探究模型在不同知识领域的能力边界与偏差,例如识别模型在STEM学科与人文社科间的表现差异。其意义在于推动了模型评估从单一指标向多维能力分析的范式转变,为模型优化与能力对齐提供了实证基础。
衍生相关工作
围绕该数据集衍生的经典工作包括基于多任务评估结果的模型能力溯源研究,如通过任务间相关性分析揭示模型泛化机制;亦有研究利用其细粒度错误样本开展针对性微调,提升模型在薄弱领域的表现。这些工作深化了对模型评估方法论的理解,并催生了如动态评估基准与领域自适应评估框架等创新方向。
以上内容由遇见数据集搜集并总结生成



