five

open-llm-leaderboard/details_tricktreat__Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj

收藏
Hugging Face2024-04-16 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_tricktreat__Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of tricktreat/Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [tricktreat/Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj](https://huggingface.co/tricktreat/Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_tricktreat__Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-04-16T04:02:45.165972](https://huggingface.co/datasets/open-llm-leaderboard/details_tricktreat__Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj/blob/main/results_2024-04-16T04-02-45.165972.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.47119617088878807,\n\ \ \"acc_stderr\": 0.03446747326117775,\n \"acc_norm\": 0.47711994927683626,\n\ \ \"acc_norm_stderr\": 0.035254004485849776,\n \"mc1\": 0.29008567931456547,\n\ \ \"mc1_stderr\": 0.01588623687420952,\n \"mc2\": 0.4315330091327328,\n\ \ \"mc2_stderr\": 0.015209088465437729\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.48208191126279865,\n \"acc_stderr\": 0.014602005585490978,\n\ \ \"acc_norm\": 0.514505119453925,\n \"acc_norm_stderr\": 0.014605241081370056\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.5862378012348137,\n\ \ \"acc_stderr\": 0.00491500349951783,\n \"acc_norm\": 0.7698665604461262,\n\ \ \"acc_norm_stderr\": 0.004200578535056529\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.3,\n \"acc_stderr\": 0.046056618647183814,\n \ \ \"acc_norm\": 0.3,\n \"acc_norm_stderr\": 0.046056618647183814\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.37777777777777777,\n\ \ \"acc_stderr\": 0.04188307537595853,\n \"acc_norm\": 0.37777777777777777,\n\ \ \"acc_norm_stderr\": 0.04188307537595853\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.46710526315789475,\n \"acc_stderr\": 0.040601270352363966,\n\ \ \"acc_norm\": 0.46710526315789475,\n \"acc_norm_stderr\": 0.040601270352363966\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.45,\n\ \ \"acc_stderr\": 0.049999999999999996,\n \"acc_norm\": 0.45,\n \ \ \"acc_norm_stderr\": 0.049999999999999996\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.4490566037735849,\n \"acc_stderr\": 0.030612730713641092,\n\ \ \"acc_norm\": 0.4490566037735849,\n \"acc_norm_stderr\": 0.030612730713641092\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.5,\n\ \ \"acc_stderr\": 0.04181210050035455,\n \"acc_norm\": 0.5,\n \ \ \"acc_norm_stderr\": 0.04181210050035455\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.33,\n \"acc_stderr\": 0.04725815626252604,\n \ \ \"acc_norm\": 0.33,\n \"acc_norm_stderr\": 0.04725815626252604\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.38,\n \"acc_stderr\": 0.04878317312145633,\n \"acc_norm\": 0.38,\n\ \ \"acc_norm_stderr\": 0.04878317312145633\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.28,\n \"acc_stderr\": 0.04512608598542128,\n \ \ \"acc_norm\": 0.28,\n \"acc_norm_stderr\": 0.04512608598542128\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.4161849710982659,\n\ \ \"acc_stderr\": 0.03758517775404947,\n \"acc_norm\": 0.4161849710982659,\n\ \ \"acc_norm_stderr\": 0.03758517775404947\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.27450980392156865,\n \"acc_stderr\": 0.04440521906179327,\n\ \ \"acc_norm\": 0.27450980392156865,\n \"acc_norm_stderr\": 0.04440521906179327\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.52,\n \"acc_stderr\": 0.050211673156867795,\n \"acc_norm\": 0.52,\n\ \ \"acc_norm_stderr\": 0.050211673156867795\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.43829787234042555,\n \"acc_stderr\": 0.03243618636108101,\n\ \ \"acc_norm\": 0.43829787234042555,\n \"acc_norm_stderr\": 0.03243618636108101\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.30701754385964913,\n\ \ \"acc_stderr\": 0.043391383225798615,\n \"acc_norm\": 0.30701754385964913,\n\ \ \"acc_norm_stderr\": 0.043391383225798615\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.4206896551724138,\n \"acc_stderr\": 0.0411391498118926,\n\ \ \"acc_norm\": 0.4206896551724138,\n \"acc_norm_stderr\": 0.0411391498118926\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.28835978835978837,\n \"acc_stderr\": 0.023330654054535892,\n \"\ acc_norm\": 0.28835978835978837,\n \"acc_norm_stderr\": 0.023330654054535892\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.23809523809523808,\n\ \ \"acc_stderr\": 0.03809523809523812,\n \"acc_norm\": 0.23809523809523808,\n\ \ \"acc_norm_stderr\": 0.03809523809523812\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.35,\n \"acc_stderr\": 0.047937248544110196,\n \ \ \"acc_norm\": 0.35,\n \"acc_norm_stderr\": 0.047937248544110196\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\"\ : 0.5225806451612903,\n \"acc_stderr\": 0.028414985019707868,\n \"\ acc_norm\": 0.5225806451612903,\n \"acc_norm_stderr\": 0.028414985019707868\n\ \ },\n \"harness|hendrycksTest-high_school_chemistry|5\": {\n \"acc\"\ : 0.3793103448275862,\n \"acc_stderr\": 0.034139638059062345,\n \"\ acc_norm\": 0.3793103448275862,\n \"acc_norm_stderr\": 0.034139638059062345\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.44,\n \"acc_stderr\": 0.049888765156985884,\n \"acc_norm\"\ : 0.44,\n \"acc_norm_stderr\": 0.049888765156985884\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.5757575757575758,\n \"acc_stderr\": 0.03859268142070264,\n\ \ \"acc_norm\": 0.5757575757575758,\n \"acc_norm_stderr\": 0.03859268142070264\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.5808080808080808,\n \"acc_stderr\": 0.03515520728670417,\n \"\ acc_norm\": 0.5808080808080808,\n \"acc_norm_stderr\": 0.03515520728670417\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.7150259067357513,\n \"acc_stderr\": 0.032577140777096614,\n\ \ \"acc_norm\": 0.7150259067357513,\n \"acc_norm_stderr\": 0.032577140777096614\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.4641025641025641,\n \"acc_stderr\": 0.025285585990017838,\n\ \ \"acc_norm\": 0.4641025641025641,\n \"acc_norm_stderr\": 0.025285585990017838\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.26666666666666666,\n \"acc_stderr\": 0.026962424325073828,\n \ \ \"acc_norm\": 0.26666666666666666,\n \"acc_norm_stderr\": 0.026962424325073828\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.41596638655462187,\n \"acc_stderr\": 0.03201650100739615,\n\ \ \"acc_norm\": 0.41596638655462187,\n \"acc_norm_stderr\": 0.03201650100739615\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.36423841059602646,\n \"acc_stderr\": 0.03929111781242741,\n \"\ acc_norm\": 0.36423841059602646,\n \"acc_norm_stderr\": 0.03929111781242741\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.655045871559633,\n \"acc_stderr\": 0.020380605405066952,\n \"\ acc_norm\": 0.655045871559633,\n \"acc_norm_stderr\": 0.020380605405066952\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.36574074074074076,\n \"acc_stderr\": 0.03284738857647207,\n \"\ acc_norm\": 0.36574074074074076,\n \"acc_norm_stderr\": 0.03284738857647207\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.5882352941176471,\n \"acc_stderr\": 0.03454236585380609,\n \"\ acc_norm\": 0.5882352941176471,\n \"acc_norm_stderr\": 0.03454236585380609\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.5907172995780591,\n \"acc_stderr\": 0.03200704183359592,\n \ \ \"acc_norm\": 0.5907172995780591,\n \"acc_norm_stderr\": 0.03200704183359592\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.5381165919282511,\n\ \ \"acc_stderr\": 0.033460150119732274,\n \"acc_norm\": 0.5381165919282511,\n\ \ \"acc_norm_stderr\": 0.033460150119732274\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.5648854961832062,\n \"acc_stderr\": 0.04348208051644858,\n\ \ \"acc_norm\": 0.5648854961832062,\n \"acc_norm_stderr\": 0.04348208051644858\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.6528925619834711,\n \"acc_stderr\": 0.043457245702925335,\n \"\ acc_norm\": 0.6528925619834711,\n \"acc_norm_stderr\": 0.043457245702925335\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.5462962962962963,\n\ \ \"acc_stderr\": 0.04812917324536823,\n \"acc_norm\": 0.5462962962962963,\n\ \ \"acc_norm_stderr\": 0.04812917324536823\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.5460122699386503,\n \"acc_stderr\": 0.0391170190467718,\n\ \ \"acc_norm\": 0.5460122699386503,\n \"acc_norm_stderr\": 0.0391170190467718\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.30357142857142855,\n\ \ \"acc_stderr\": 0.04364226155841044,\n \"acc_norm\": 0.30357142857142855,\n\ \ \"acc_norm_stderr\": 0.04364226155841044\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.6990291262135923,\n \"acc_stderr\": 0.045416094465039476,\n\ \ \"acc_norm\": 0.6990291262135923,\n \"acc_norm_stderr\": 0.045416094465039476\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.6196581196581197,\n\ \ \"acc_stderr\": 0.03180425204384099,\n \"acc_norm\": 0.6196581196581197,\n\ \ \"acc_norm_stderr\": 0.03180425204384099\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.54,\n \"acc_stderr\": 0.05009082659620332,\n \ \ \"acc_norm\": 0.54,\n \"acc_norm_stderr\": 0.05009082659620332\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.6538952745849298,\n\ \ \"acc_stderr\": 0.017011965266412073,\n \"acc_norm\": 0.6538952745849298,\n\ \ \"acc_norm_stderr\": 0.017011965266412073\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.5404624277456648,\n \"acc_stderr\": 0.026830805998952257,\n\ \ \"acc_norm\": 0.5404624277456648,\n \"acc_norm_stderr\": 0.026830805998952257\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.3340782122905028,\n\ \ \"acc_stderr\": 0.01577491142238163,\n \"acc_norm\": 0.3340782122905028,\n\ \ \"acc_norm_stderr\": 0.01577491142238163\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.5424836601307189,\n \"acc_stderr\": 0.02852638345214263,\n\ \ \"acc_norm\": 0.5424836601307189,\n \"acc_norm_stderr\": 0.02852638345214263\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.5080385852090032,\n\ \ \"acc_stderr\": 0.028394421370984545,\n \"acc_norm\": 0.5080385852090032,\n\ \ \"acc_norm_stderr\": 0.028394421370984545\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.5339506172839507,\n \"acc_stderr\": 0.027756535257347663,\n\ \ \"acc_norm\": 0.5339506172839507,\n \"acc_norm_stderr\": 0.027756535257347663\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.29432624113475175,\n \"acc_stderr\": 0.027187127011503793,\n \ \ \"acc_norm\": 0.29432624113475175,\n \"acc_norm_stderr\": 0.027187127011503793\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.333116036505867,\n\ \ \"acc_stderr\": 0.012037930451512052,\n \"acc_norm\": 0.333116036505867,\n\ \ \"acc_norm_stderr\": 0.012037930451512052\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.48161764705882354,\n \"acc_stderr\": 0.03035230339535196,\n\ \ \"acc_norm\": 0.48161764705882354,\n \"acc_norm_stderr\": 0.03035230339535196\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.4117647058823529,\n \"acc_stderr\": 0.019910377463105935,\n \ \ \"acc_norm\": 0.4117647058823529,\n \"acc_norm_stderr\": 0.019910377463105935\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.5272727272727272,\n\ \ \"acc_stderr\": 0.04782001791380061,\n \"acc_norm\": 0.5272727272727272,\n\ \ \"acc_norm_stderr\": 0.04782001791380061\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.5224489795918368,\n \"acc_stderr\": 0.031976941187136725,\n\ \ \"acc_norm\": 0.5224489795918368,\n \"acc_norm_stderr\": 0.031976941187136725\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.6368159203980099,\n\ \ \"acc_stderr\": 0.034005985055990146,\n \"acc_norm\": 0.6368159203980099,\n\ \ \"acc_norm_stderr\": 0.034005985055990146\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.64,\n \"acc_stderr\": 0.04824181513244218,\n \ \ \"acc_norm\": 0.64,\n \"acc_norm_stderr\": 0.04824181513244218\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.4819277108433735,\n\ \ \"acc_stderr\": 0.038899512528272166,\n \"acc_norm\": 0.4819277108433735,\n\ \ \"acc_norm_stderr\": 0.038899512528272166\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.7017543859649122,\n \"acc_stderr\": 0.03508771929824562,\n\ \ \"acc_norm\": 0.7017543859649122,\n \"acc_norm_stderr\": 0.03508771929824562\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.29008567931456547,\n\ \ \"mc1_stderr\": 0.01588623687420952,\n \"mc2\": 0.4315330091327328,\n\ \ \"mc2_stderr\": 0.015209088465437729\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.6921862667719021,\n \"acc_stderr\": 0.01297294666120502\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.11675511751326763,\n \ \ \"acc_stderr\": 0.008845468136919098\n }\n}\n```" repo_url: https://huggingface.co/tricktreat/Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|arc:challenge|25_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-04-16T04-02-45.165972.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|gsm8k|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hellaswag|10_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-16T04-02-45.165972.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-management|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-virology|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-16T04-02-45.165972.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|truthfulqa:mc|0_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-04-16T04-02-45.165972.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_04_16T04_02_45.165972 path: - '**/details_harness|winogrande|5_2024-04-16T04-02-45.165972.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-04-16T04-02-45.165972.parquet' - config_name: results data_files: - split: 2024_04_16T04_02_45.165972 path: - results_2024-04-16T04-02-45.165972.parquet - split: latest path: - results_2024-04-16T04-02-45.165972.parquet --- # Dataset Card for Evaluation run of tricktreat/Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [tricktreat/Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj](https://huggingface.co/tricktreat/Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_tricktreat__Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-04-16T04:02:45.165972](https://huggingface.co/datasets/open-llm-leaderboard/details_tricktreat__Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj/blob/main/results_2024-04-16T04-02-45.165972.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.47119617088878807, "acc_stderr": 0.03446747326117775, "acc_norm": 0.47711994927683626, "acc_norm_stderr": 0.035254004485849776, "mc1": 0.29008567931456547, "mc1_stderr": 0.01588623687420952, "mc2": 0.4315330091327328, "mc2_stderr": 0.015209088465437729 }, "harness|arc:challenge|25": { "acc": 0.48208191126279865, "acc_stderr": 0.014602005585490978, "acc_norm": 0.514505119453925, "acc_norm_stderr": 0.014605241081370056 }, "harness|hellaswag|10": { "acc": 0.5862378012348137, "acc_stderr": 0.00491500349951783, "acc_norm": 0.7698665604461262, "acc_norm_stderr": 0.004200578535056529 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.37777777777777777, "acc_stderr": 0.04188307537595853, "acc_norm": 0.37777777777777777, "acc_norm_stderr": 0.04188307537595853 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.46710526315789475, "acc_stderr": 0.040601270352363966, "acc_norm": 0.46710526315789475, "acc_norm_stderr": 0.040601270352363966 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.45, "acc_stderr": 0.049999999999999996, "acc_norm": 0.45, "acc_norm_stderr": 0.049999999999999996 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.4490566037735849, "acc_stderr": 0.030612730713641092, "acc_norm": 0.4490566037735849, "acc_norm_stderr": 0.030612730713641092 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.5, "acc_stderr": 0.04181210050035455, "acc_norm": 0.5, "acc_norm_stderr": 0.04181210050035455 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.33, "acc_stderr": 0.04725815626252604, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252604 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.38, "acc_stderr": 0.04878317312145633, "acc_norm": 0.38, "acc_norm_stderr": 0.04878317312145633 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.28, "acc_stderr": 0.04512608598542128, "acc_norm": 0.28, "acc_norm_stderr": 0.04512608598542128 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.4161849710982659, "acc_stderr": 0.03758517775404947, "acc_norm": 0.4161849710982659, "acc_norm_stderr": 0.03758517775404947 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.27450980392156865, "acc_stderr": 0.04440521906179327, "acc_norm": 0.27450980392156865, "acc_norm_stderr": 0.04440521906179327 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.52, "acc_stderr": 0.050211673156867795, "acc_norm": 0.52, "acc_norm_stderr": 0.050211673156867795 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.43829787234042555, "acc_stderr": 0.03243618636108101, "acc_norm": 0.43829787234042555, "acc_norm_stderr": 0.03243618636108101 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.30701754385964913, "acc_stderr": 0.043391383225798615, "acc_norm": 0.30701754385964913, "acc_norm_stderr": 0.043391383225798615 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.4206896551724138, "acc_stderr": 0.0411391498118926, "acc_norm": 0.4206896551724138, "acc_norm_stderr": 0.0411391498118926 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.28835978835978837, "acc_stderr": 0.023330654054535892, "acc_norm": 0.28835978835978837, "acc_norm_stderr": 0.023330654054535892 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.23809523809523808, "acc_stderr": 0.03809523809523812, "acc_norm": 0.23809523809523808, "acc_norm_stderr": 0.03809523809523812 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.35, "acc_stderr": 0.047937248544110196, "acc_norm": 0.35, "acc_norm_stderr": 0.047937248544110196 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.5225806451612903, "acc_stderr": 0.028414985019707868, "acc_norm": 0.5225806451612903, "acc_norm_stderr": 0.028414985019707868 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.3793103448275862, "acc_stderr": 0.034139638059062345, "acc_norm": 0.3793103448275862, "acc_norm_stderr": 0.034139638059062345 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.44, "acc_stderr": 0.049888765156985884, "acc_norm": 0.44, "acc_norm_stderr": 0.049888765156985884 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.5757575757575758, "acc_stderr": 0.03859268142070264, "acc_norm": 0.5757575757575758, "acc_norm_stderr": 0.03859268142070264 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.5808080808080808, "acc_stderr": 0.03515520728670417, "acc_norm": 0.5808080808080808, "acc_norm_stderr": 0.03515520728670417 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.7150259067357513, "acc_stderr": 0.032577140777096614, "acc_norm": 0.7150259067357513, "acc_norm_stderr": 0.032577140777096614 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.4641025641025641, "acc_stderr": 0.025285585990017838, "acc_norm": 0.4641025641025641, "acc_norm_stderr": 0.025285585990017838 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.26666666666666666, "acc_stderr": 0.026962424325073828, "acc_norm": 0.26666666666666666, "acc_norm_stderr": 0.026962424325073828 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.41596638655462187, "acc_stderr": 0.03201650100739615, "acc_norm": 0.41596638655462187, "acc_norm_stderr": 0.03201650100739615 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.36423841059602646, "acc_stderr": 0.03929111781242741, "acc_norm": 0.36423841059602646, "acc_norm_stderr": 0.03929111781242741 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.655045871559633, "acc_stderr": 0.020380605405066952, "acc_norm": 0.655045871559633, "acc_norm_stderr": 0.020380605405066952 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.36574074074074076, "acc_stderr": 0.03284738857647207, "acc_norm": 0.36574074074074076, "acc_norm_stderr": 0.03284738857647207 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.5882352941176471, "acc_stderr": 0.03454236585380609, "acc_norm": 0.5882352941176471, "acc_norm_stderr": 0.03454236585380609 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.5907172995780591, "acc_stderr": 0.03200704183359592, "acc_norm": 0.5907172995780591, "acc_norm_stderr": 0.03200704183359592 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.5381165919282511, "acc_stderr": 0.033460150119732274, "acc_norm": 0.5381165919282511, "acc_norm_stderr": 0.033460150119732274 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.5648854961832062, "acc_stderr": 0.04348208051644858, "acc_norm": 0.5648854961832062, "acc_norm_stderr": 0.04348208051644858 }, "harness|hendrycksTest-international_law|5": { "acc": 0.6528925619834711, "acc_stderr": 0.043457245702925335, "acc_norm": 0.6528925619834711, "acc_norm_stderr": 0.043457245702925335 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.5462962962962963, "acc_stderr": 0.04812917324536823, "acc_norm": 0.5462962962962963, "acc_norm_stderr": 0.04812917324536823 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.5460122699386503, "acc_stderr": 0.0391170190467718, "acc_norm": 0.5460122699386503, "acc_norm_stderr": 0.0391170190467718 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.30357142857142855, "acc_stderr": 0.04364226155841044, "acc_norm": 0.30357142857142855, "acc_norm_stderr": 0.04364226155841044 }, "harness|hendrycksTest-management|5": { "acc": 0.6990291262135923, "acc_stderr": 0.045416094465039476, "acc_norm": 0.6990291262135923, "acc_norm_stderr": 0.045416094465039476 }, "harness|hendrycksTest-marketing|5": { "acc": 0.6196581196581197, "acc_stderr": 0.03180425204384099, "acc_norm": 0.6196581196581197, "acc_norm_stderr": 0.03180425204384099 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.54, "acc_stderr": 0.05009082659620332, "acc_norm": 0.54, "acc_norm_stderr": 0.05009082659620332 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.6538952745849298, "acc_stderr": 0.017011965266412073, "acc_norm": 0.6538952745849298, "acc_norm_stderr": 0.017011965266412073 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.5404624277456648, "acc_stderr": 0.026830805998952257, "acc_norm": 0.5404624277456648, "acc_norm_stderr": 0.026830805998952257 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.3340782122905028, "acc_stderr": 0.01577491142238163, "acc_norm": 0.3340782122905028, "acc_norm_stderr": 0.01577491142238163 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.5424836601307189, "acc_stderr": 0.02852638345214263, "acc_norm": 0.5424836601307189, "acc_norm_stderr": 0.02852638345214263 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.5080385852090032, "acc_stderr": 0.028394421370984545, "acc_norm": 0.5080385852090032, "acc_norm_stderr": 0.028394421370984545 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.5339506172839507, "acc_stderr": 0.027756535257347663, "acc_norm": 0.5339506172839507, "acc_norm_stderr": 0.027756535257347663 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.29432624113475175, "acc_stderr": 0.027187127011503793, "acc_norm": 0.29432624113475175, "acc_norm_stderr": 0.027187127011503793 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.333116036505867, "acc_stderr": 0.012037930451512052, "acc_norm": 0.333116036505867, "acc_norm_stderr": 0.012037930451512052 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.48161764705882354, "acc_stderr": 0.03035230339535196, "acc_norm": 0.48161764705882354, "acc_norm_stderr": 0.03035230339535196 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.4117647058823529, "acc_stderr": 0.019910377463105935, "acc_norm": 0.4117647058823529, "acc_norm_stderr": 0.019910377463105935 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.5272727272727272, "acc_stderr": 0.04782001791380061, "acc_norm": 0.5272727272727272, "acc_norm_stderr": 0.04782001791380061 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.5224489795918368, "acc_stderr": 0.031976941187136725, "acc_norm": 0.5224489795918368, "acc_norm_stderr": 0.031976941187136725 }, "harness|hendrycksTest-sociology|5": { "acc": 0.6368159203980099, "acc_stderr": 0.034005985055990146, "acc_norm": 0.6368159203980099, "acc_norm_stderr": 0.034005985055990146 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.64, "acc_stderr": 0.04824181513244218, "acc_norm": 0.64, "acc_norm_stderr": 0.04824181513244218 }, "harness|hendrycksTest-virology|5": { "acc": 0.4819277108433735, "acc_stderr": 0.038899512528272166, "acc_norm": 0.4819277108433735, "acc_norm_stderr": 0.038899512528272166 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.7017543859649122, "acc_stderr": 0.03508771929824562, "acc_norm": 0.7017543859649122, "acc_norm_stderr": 0.03508771929824562 }, "harness|truthfulqa:mc|0": { "mc1": 0.29008567931456547, "mc1_stderr": 0.01588623687420952, "mc2": 0.4315330091327328, "mc2_stderr": 0.015209088465437729 }, "harness|winogrande|5": { "acc": 0.6921862667719021, "acc_stderr": 0.01297294666120502 }, "harness|gsm8k|5": { "acc": 0.11675511751326763, "acc_stderr": 0.008845468136919098 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

数据集名称

  • pretty_name: Evaluation run of tricktreat/Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj

数据集来源

数据集组成

  • 配置数量: 63
  • 每个配置对应任务: 一个评估任务
  • 创建自: 1次运行
  • 数据集分割: 每个配置中的特定分割,命名基于运行的时间戳,"train"分割指向最新结果。
  • 额外配置: "results",存储所有运行的聚合结果,用于计算和显示聚合指标。

加载数据示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_tricktreat__Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj", "harness_winogrande_5", split="train")

最新结果

  • 最新结果来自run 2024-04-16T04:02:45.165972
  • 包含多个任务的评估结果,每个任务在"results"和相应的"latest"分割中可找到。

数据集详细配置

配置列表

  • harness_arc_challenge_25
  • harness_gsm8k_5
  • harness_hellaswag_10
  • harness_hendrycksTest_5
    • 包含多个子任务,如abstract_algebra, anatomy, astronomy等。

每个配置包含基于时间戳的分割和指向最新结果的"latest"分割。

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,Open LLM Leaderboard 提供了一个标准化的评测框架。该数据集是在对模型 tricktreat/Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj 进行评测过程中自动生成的产物。数据集由63个配置组成,每个配置对应一个被评估的任务。这些配置源自一次完整的评测运行,每次运行的结果均以独立拆分的形式存储,拆分名称使用运行的时间戳进行标识。此外,一个名为“results”的额外配置汇集了所有聚合结果,用于在Leaderboard上计算并展示综合指标。
使用方法
用户可通过Hugging Face Datasets库便捷地加载该数据集。例如,使用`load_dataset`函数指定数据集名称与目标任务的配置名称,如`harness_winogrande_5`,即可获取对应任务的详细评估结果。通过选择`split="train"`参数,能够直接访问最新一次的评测数据。此外,用户也可通过指定具体的时间戳拆分,回溯历史评测结果,便于进行模型性能的纵向对比与分析。
背景与挑战
背景概述
在大型语言模型(LLM)性能评估领域,Open LLM Leaderboard由Hugging Face团队于2023年发起,旨在为社区提供一个标准化、透明化的模型评测平台。该数据集记录了tricktreat/Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj模型在2024年4月16日的评估结果,覆盖ARC-Challenge、HellaSwag、MMLU、TruthfulQA、Winogrande及GSM8K等63项任务,全面衡量模型在常识推理、知识理解、事实一致性与数学能力上的表现。通过细粒度指标(如acc_norm、mc2)与误差统计,该数据集为研究参数冻结与嵌入微调策略对Llama-2-7B模型的影响提供了关键实证,推动了高效微调方法在对话式LLM中的可复现性研究。
当前挑战
该数据集面临的挑战体现在两个层面:首先,在领域问题层面,它需解决LLM评估中多任务异构性带来的度量统一难题,例如GSM8K的数学推理(acc仅0.117)与Winogrande的代词消解(acc 0.692)显示模型能力严重不均衡,亟需设计更鲁棒的跨任务泛化基准。其次,在构建过程中,数据集的自动生成机制依赖单一时间戳运行结果,缺乏多轮验证以对抗随机性(如MMLU中formal_logic的acc仅0.238),且任务配置的固定采样数量(如5-shot)可能无法充分反映模型在少样本场景下的真实边界,限制了leaderboard作为公平比较工具的权威性。
常用场景
经典使用场景
在大语言模型迅猛发展的浪潮中,对模型性能进行系统性评估已成为推动技术进步的关键环节。Open LLM Leaderboard 上的评估数据集为此提供了标准化平台,该数据集记录了 tricktreat/Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj 模型在63个任务配置上的详尽评测结果。其经典使用场景在于,研究者可借助此数据集复现模型在 ARC-Challenge、HellaSwag、MMLU 多学科知识、TruthfulQA、Winogrande 以及 GSM8K 等基准测试中的表现,从而精准量化模型在推理、常识理解、事实性和数学求解等维度的能力。
解决学术问题
该数据集有效回应了当前学术界面临的核心挑战——如何客观、可重复地衡量不同微调策略对大语言模型性能的实际影响。通过提供细粒度的任务级精度与标准误差,研究者得以深入剖析参数冻结、嵌入令牌投影等优化手段在特定领域(如形式逻辑、医学遗传学、高等数学)的效能。这为解决模型泛化能力与专业领域适应性之间的张力提供了实证基础,推动了关于模型架构与训练范式关系的理论探讨。
实际应用
在实际应用层面,该数据集为工业界部署大语言模型提供了关键决策依据。例如,当企业需要将模型集成至客服系统或教育辅导工具时,可参考 GSM8K 任务的数学推理得分(11.7%)来评估模型处理量化问题的可靠性,或依据 TruthfulQA 的 MC2 分数(43.2%)判断其事实陈述的准确度。这些指标直接指导了模型在金融分析、法律咨询等高风险场景中的适用性筛选。
数据集最近研究
最新研究方向
在大型语言模型(LLM)性能评估的前沿领域,Open LLM Leaderboard 已成为衡量模型综合能力的重要基准平台。围绕 Llama-2-7b-chat-hf-guanaco-freeze-embed-tokens-q-v-proj 模型的评估数据,集中反映了当前研究对模型在常识推理、知识问答和数学求解等多维度能力的精细化剖析。该数据集通过涵盖 ARC、HellaSwag、GSM8K 及 MMLU 等 63 项任务的标准化评测,揭示了模型在复杂推理与领域知识迁移上的表现差异,尤其在高阶数学与形式逻辑任务中暴露的局限性,为后续的指令微调与参数冻结策略优化提供了关键实证。这一研究方向与近期社区对轻量化、可复用模型评测体系的迫切需求紧密呼应,其意义在于推动 LLM 评估从单一指标向结构化、多任务全景图的演进,为模型的可解释性与鲁棒性改进奠定了数据驱动的基石。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务