five

open-llm-leaderboard/details_BFauber__opt125m_10e5_lr2e-7

收藏
Hugging Face2024-04-23 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_BFauber__opt125m_10e5_lr2e-7
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of BFauber/opt125m_10e5_lr2e-7 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [BFauber/opt125m_10e5_lr2e-7](https://huggingface.co/BFauber/opt125m_10e5_lr2e-7)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_BFauber__opt125m_10e5_lr2e-7\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-04-23T19:20:56.142177](https://huggingface.co/datasets/open-llm-leaderboard/details_BFauber__opt125m_10e5_lr2e-7/blob/main/results_2024-04-23T19-20-56.142177.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.26889222255083106,\n\ \ \"acc_stderr\": 0.031147433258079334,\n \"acc_norm\": 0.27007906116095587,\n\ \ \"acc_norm_stderr\": 0.03194963875396083,\n \"mc1\": 0.23133414932680538,\n\ \ \"mc1_stderr\": 0.014761945174862671,\n \"mc2\": 0.43041471274187126,\n\ \ \"mc2_stderr\": 0.01495839156133983\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.20819112627986347,\n \"acc_stderr\": 0.011864866118448064,\n\ \ \"acc_norm\": 0.23378839590443687,\n \"acc_norm_stderr\": 0.012368225378507123\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.29028082055367455,\n\ \ \"acc_stderr\": 0.004529642828546409,\n \"acc_norm\": 0.3121888070105557,\n\ \ \"acc_norm_stderr\": 0.004624393690966894\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.26,\n \"acc_stderr\": 0.04408440022768078,\n \ \ \"acc_norm\": 0.26,\n \"acc_norm_stderr\": 0.04408440022768078\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.22962962962962963,\n\ \ \"acc_stderr\": 0.03633384414073461,\n \"acc_norm\": 0.22962962962962963,\n\ \ \"acc_norm_stderr\": 0.03633384414073461\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.29605263157894735,\n \"acc_stderr\": 0.03715062154998905,\n\ \ \"acc_norm\": 0.29605263157894735,\n \"acc_norm_stderr\": 0.03715062154998905\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.21,\n\ \ \"acc_stderr\": 0.040936018074033256,\n \"acc_norm\": 0.21,\n \ \ \"acc_norm_stderr\": 0.040936018074033256\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.28679245283018867,\n \"acc_stderr\": 0.027834912527544078,\n\ \ \"acc_norm\": 0.28679245283018867,\n \"acc_norm_stderr\": 0.027834912527544078\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.22916666666666666,\n\ \ \"acc_stderr\": 0.03514697467862388,\n \"acc_norm\": 0.22916666666666666,\n\ \ \"acc_norm_stderr\": 0.03514697467862388\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.37,\n \"acc_stderr\": 0.04852365870939099,\n \ \ \"acc_norm\": 0.37,\n \"acc_norm_stderr\": 0.04852365870939099\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.33,\n \"acc_stderr\": 0.04725815626252604,\n \"acc_norm\": 0.33,\n\ \ \"acc_norm_stderr\": 0.04725815626252604\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.31213872832369943,\n\ \ \"acc_stderr\": 0.03533133389323657,\n \"acc_norm\": 0.31213872832369943,\n\ \ \"acc_norm_stderr\": 0.03533133389323657\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.37254901960784315,\n \"acc_stderr\": 0.04810840148082633,\n\ \ \"acc_norm\": 0.37254901960784315,\n \"acc_norm_stderr\": 0.04810840148082633\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.18,\n \"acc_stderr\": 0.038612291966536955,\n \"acc_norm\": 0.18,\n\ \ \"acc_norm_stderr\": 0.038612291966536955\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.2936170212765957,\n \"acc_stderr\": 0.029771642712491227,\n\ \ \"acc_norm\": 0.2936170212765957,\n \"acc_norm_stderr\": 0.029771642712491227\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.2543859649122807,\n\ \ \"acc_stderr\": 0.04096985139843671,\n \"acc_norm\": 0.2543859649122807,\n\ \ \"acc_norm_stderr\": 0.04096985139843671\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.296551724137931,\n \"acc_stderr\": 0.03806142687309994,\n\ \ \"acc_norm\": 0.296551724137931,\n \"acc_norm_stderr\": 0.03806142687309994\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.2566137566137566,\n \"acc_stderr\": 0.022494510767503154,\n \"\ acc_norm\": 0.2566137566137566,\n \"acc_norm_stderr\": 0.022494510767503154\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.2777777777777778,\n\ \ \"acc_stderr\": 0.040061680838488795,\n \"acc_norm\": 0.2777777777777778,\n\ \ \"acc_norm_stderr\": 0.040061680838488795\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.19,\n \"acc_stderr\": 0.03942772444036624,\n \ \ \"acc_norm\": 0.19,\n \"acc_norm_stderr\": 0.03942772444036624\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.3161290322580645,\n\ \ \"acc_stderr\": 0.02645087448904277,\n \"acc_norm\": 0.3161290322580645,\n\ \ \"acc_norm_stderr\": 0.02645087448904277\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.2955665024630542,\n \"acc_stderr\": 0.032104944337514575,\n\ \ \"acc_norm\": 0.2955665024630542,\n \"acc_norm_stderr\": 0.032104944337514575\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.19,\n \"acc_stderr\": 0.039427724440366234,\n \"acc_norm\"\ : 0.19,\n \"acc_norm_stderr\": 0.039427724440366234\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.21818181818181817,\n \"acc_stderr\": 0.03225078108306289,\n\ \ \"acc_norm\": 0.21818181818181817,\n \"acc_norm_stderr\": 0.03225078108306289\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.30808080808080807,\n \"acc_stderr\": 0.032894773300986155,\n \"\ acc_norm\": 0.30808080808080807,\n \"acc_norm_stderr\": 0.032894773300986155\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.36787564766839376,\n \"acc_stderr\": 0.03480175668466036,\n\ \ \"acc_norm\": 0.36787564766839376,\n \"acc_norm_stderr\": 0.03480175668466036\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.36153846153846153,\n \"acc_stderr\": 0.024359581465396983,\n\ \ \"acc_norm\": 0.36153846153846153,\n \"acc_norm_stderr\": 0.024359581465396983\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.26296296296296295,\n \"acc_stderr\": 0.026842057873833706,\n \ \ \"acc_norm\": 0.26296296296296295,\n \"acc_norm_stderr\": 0.026842057873833706\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.3445378151260504,\n \"acc_stderr\": 0.030868682604121633,\n\ \ \"acc_norm\": 0.3445378151260504,\n \"acc_norm_stderr\": 0.030868682604121633\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.33112582781456956,\n \"acc_stderr\": 0.038425817186598696,\n \"\ acc_norm\": 0.33112582781456956,\n \"acc_norm_stderr\": 0.038425817186598696\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.3155963302752294,\n \"acc_stderr\": 0.019926117513869666,\n \"\ acc_norm\": 0.3155963302752294,\n \"acc_norm_stderr\": 0.019926117513869666\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.4722222222222222,\n \"acc_stderr\": 0.0340470532865388,\n \"acc_norm\"\ : 0.4722222222222222,\n \"acc_norm_stderr\": 0.0340470532865388\n },\n\ \ \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\": 0.24019607843137256,\n\ \ \"acc_stderr\": 0.02998373305591361,\n \"acc_norm\": 0.24019607843137256,\n\ \ \"acc_norm_stderr\": 0.02998373305591361\n },\n \"harness|hendrycksTest-high_school_world_history|5\"\ : {\n \"acc\": 0.22362869198312235,\n \"acc_stderr\": 0.027123298205229972,\n\ \ \"acc_norm\": 0.22362869198312235,\n \"acc_norm_stderr\": 0.027123298205229972\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.11659192825112108,\n\ \ \"acc_stderr\": 0.02153963981624447,\n \"acc_norm\": 0.11659192825112108,\n\ \ \"acc_norm_stderr\": 0.02153963981624447\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.2595419847328244,\n \"acc_stderr\": 0.03844876139785271,\n\ \ \"acc_norm\": 0.2595419847328244,\n \"acc_norm_stderr\": 0.03844876139785271\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.371900826446281,\n \"acc_stderr\": 0.04412015806624505,\n \"acc_norm\"\ : 0.371900826446281,\n \"acc_norm_stderr\": 0.04412015806624505\n },\n\ \ \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.21296296296296297,\n\ \ \"acc_stderr\": 0.0395783547198098,\n \"acc_norm\": 0.21296296296296297,\n\ \ \"acc_norm_stderr\": 0.0395783547198098\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.2331288343558282,\n \"acc_stderr\": 0.033220157957767414,\n\ \ \"acc_norm\": 0.2331288343558282,\n \"acc_norm_stderr\": 0.033220157957767414\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.16964285714285715,\n\ \ \"acc_stderr\": 0.0356236785009539,\n \"acc_norm\": 0.16964285714285715,\n\ \ \"acc_norm_stderr\": 0.0356236785009539\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.36893203883495146,\n \"acc_stderr\": 0.047776151811567386,\n\ \ \"acc_norm\": 0.36893203883495146,\n \"acc_norm_stderr\": 0.047776151811567386\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.19658119658119658,\n\ \ \"acc_stderr\": 0.02603538609895129,\n \"acc_norm\": 0.19658119658119658,\n\ \ \"acc_norm_stderr\": 0.02603538609895129\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.23,\n \"acc_stderr\": 0.04229525846816507,\n \ \ \"acc_norm\": 0.23,\n \"acc_norm_stderr\": 0.04229525846816507\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.20561941251596424,\n\ \ \"acc_stderr\": 0.014452500456785825,\n \"acc_norm\": 0.20561941251596424,\n\ \ \"acc_norm_stderr\": 0.014452500456785825\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.2254335260115607,\n \"acc_stderr\": 0.022497230190967547,\n\ \ \"acc_norm\": 0.2254335260115607,\n \"acc_norm_stderr\": 0.022497230190967547\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.27262569832402234,\n\ \ \"acc_stderr\": 0.014893391735249588,\n \"acc_norm\": 0.27262569832402234,\n\ \ \"acc_norm_stderr\": 0.014893391735249588\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.2908496732026144,\n \"acc_stderr\": 0.026004800363952113,\n\ \ \"acc_norm\": 0.2908496732026144,\n \"acc_norm_stderr\": 0.026004800363952113\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.24115755627009647,\n\ \ \"acc_stderr\": 0.024296594034763426,\n \"acc_norm\": 0.24115755627009647,\n\ \ \"acc_norm_stderr\": 0.024296594034763426\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.2222222222222222,\n \"acc_stderr\": 0.023132376234543343,\n\ \ \"acc_norm\": 0.2222222222222222,\n \"acc_norm_stderr\": 0.023132376234543343\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.2872340425531915,\n \"acc_stderr\": 0.026992199173064356,\n \ \ \"acc_norm\": 0.2872340425531915,\n \"acc_norm_stderr\": 0.026992199173064356\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.24967405475880053,\n\ \ \"acc_stderr\": 0.011054538377832329,\n \"acc_norm\": 0.24967405475880053,\n\ \ \"acc_norm_stderr\": 0.011054538377832329\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.4485294117647059,\n \"acc_stderr\": 0.030211479609121593,\n\ \ \"acc_norm\": 0.4485294117647059,\n \"acc_norm_stderr\": 0.030211479609121593\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.21568627450980393,\n \"acc_stderr\": 0.01663931935031326,\n \ \ \"acc_norm\": 0.21568627450980393,\n \"acc_norm_stderr\": 0.01663931935031326\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.22727272727272727,\n\ \ \"acc_stderr\": 0.04013964554072774,\n \"acc_norm\": 0.22727272727272727,\n\ \ \"acc_norm_stderr\": 0.04013964554072774\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.2571428571428571,\n \"acc_stderr\": 0.027979823538744546,\n\ \ \"acc_norm\": 0.2571428571428571,\n \"acc_norm_stderr\": 0.027979823538744546\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.23880597014925373,\n\ \ \"acc_stderr\": 0.030147775935409224,\n \"acc_norm\": 0.23880597014925373,\n\ \ \"acc_norm_stderr\": 0.030147775935409224\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.27,\n \"acc_stderr\": 0.04461960433384739,\n \ \ \"acc_norm\": 0.27,\n \"acc_norm_stderr\": 0.04461960433384739\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.19879518072289157,\n\ \ \"acc_stderr\": 0.03106939026078943,\n \"acc_norm\": 0.19879518072289157,\n\ \ \"acc_norm_stderr\": 0.03106939026078943\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.17543859649122806,\n \"acc_stderr\": 0.029170885500727654,\n\ \ \"acc_norm\": 0.17543859649122806,\n \"acc_norm_stderr\": 0.029170885500727654\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.23133414932680538,\n\ \ \"mc1_stderr\": 0.014761945174862671,\n \"mc2\": 0.43041471274187126,\n\ \ \"mc2_stderr\": 0.01495839156133983\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.5122336227308603,\n \"acc_stderr\": 0.01404827882040562\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.003032600454890068,\n \ \ \"acc_stderr\": 0.0015145735612245494\n }\n}\n```" repo_url: https://huggingface.co/BFauber/opt125m_10e5_lr2e-7 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|arc:challenge|25_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-04-23T19-20-56.142177.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|gsm8k|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hellaswag|10_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-23T19-20-56.142177.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-management|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-virology|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-23T19-20-56.142177.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|truthfulqa:mc|0_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-04-23T19-20-56.142177.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_04_23T19_20_56.142177 path: - '**/details_harness|winogrande|5_2024-04-23T19-20-56.142177.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-04-23T19-20-56.142177.parquet' - config_name: results data_files: - split: 2024_04_23T19_20_56.142177 path: - results_2024-04-23T19-20-56.142177.parquet - split: latest path: - results_2024-04-23T19-20-56.142177.parquet --- # Dataset Card for Evaluation run of BFauber/opt125m_10e5_lr2e-7 <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [BFauber/opt125m_10e5_lr2e-7](https://huggingface.co/BFauber/opt125m_10e5_lr2e-7) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_BFauber__opt125m_10e5_lr2e-7", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-04-23T19:20:56.142177](https://huggingface.co/datasets/open-llm-leaderboard/details_BFauber__opt125m_10e5_lr2e-7/blob/main/results_2024-04-23T19-20-56.142177.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.26889222255083106, "acc_stderr": 0.031147433258079334, "acc_norm": 0.27007906116095587, "acc_norm_stderr": 0.03194963875396083, "mc1": 0.23133414932680538, "mc1_stderr": 0.014761945174862671, "mc2": 0.43041471274187126, "mc2_stderr": 0.01495839156133983 }, "harness|arc:challenge|25": { "acc": 0.20819112627986347, "acc_stderr": 0.011864866118448064, "acc_norm": 0.23378839590443687, "acc_norm_stderr": 0.012368225378507123 }, "harness|hellaswag|10": { "acc": 0.29028082055367455, "acc_stderr": 0.004529642828546409, "acc_norm": 0.3121888070105557, "acc_norm_stderr": 0.004624393690966894 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.26, "acc_stderr": 0.04408440022768078, "acc_norm": 0.26, "acc_norm_stderr": 0.04408440022768078 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.22962962962962963, "acc_stderr": 0.03633384414073461, "acc_norm": 0.22962962962962963, "acc_norm_stderr": 0.03633384414073461 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.29605263157894735, "acc_stderr": 0.03715062154998905, "acc_norm": 0.29605263157894735, "acc_norm_stderr": 0.03715062154998905 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.21, "acc_stderr": 0.040936018074033256, "acc_norm": 0.21, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.28679245283018867, "acc_stderr": 0.027834912527544078, "acc_norm": 0.28679245283018867, "acc_norm_stderr": 0.027834912527544078 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.22916666666666666, "acc_stderr": 0.03514697467862388, "acc_norm": 0.22916666666666666, "acc_norm_stderr": 0.03514697467862388 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.37, "acc_stderr": 0.04852365870939099, "acc_norm": 0.37, "acc_norm_stderr": 0.04852365870939099 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.33, "acc_stderr": 0.04725815626252604, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252604 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.31213872832369943, "acc_stderr": 0.03533133389323657, "acc_norm": 0.31213872832369943, "acc_norm_stderr": 0.03533133389323657 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.37254901960784315, "acc_stderr": 0.04810840148082633, "acc_norm": 0.37254901960784315, "acc_norm_stderr": 0.04810840148082633 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.18, "acc_stderr": 0.038612291966536955, "acc_norm": 0.18, "acc_norm_stderr": 0.038612291966536955 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.2936170212765957, "acc_stderr": 0.029771642712491227, "acc_norm": 0.2936170212765957, "acc_norm_stderr": 0.029771642712491227 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.2543859649122807, "acc_stderr": 0.04096985139843671, "acc_norm": 0.2543859649122807, "acc_norm_stderr": 0.04096985139843671 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.296551724137931, "acc_stderr": 0.03806142687309994, "acc_norm": 0.296551724137931, "acc_norm_stderr": 0.03806142687309994 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.2566137566137566, "acc_stderr": 0.022494510767503154, "acc_norm": 0.2566137566137566, "acc_norm_stderr": 0.022494510767503154 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.2777777777777778, "acc_stderr": 0.040061680838488795, "acc_norm": 0.2777777777777778, "acc_norm_stderr": 0.040061680838488795 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.19, "acc_stderr": 0.03942772444036624, "acc_norm": 0.19, "acc_norm_stderr": 0.03942772444036624 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.3161290322580645, "acc_stderr": 0.02645087448904277, "acc_norm": 0.3161290322580645, "acc_norm_stderr": 0.02645087448904277 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.2955665024630542, "acc_stderr": 0.032104944337514575, "acc_norm": 0.2955665024630542, "acc_norm_stderr": 0.032104944337514575 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.19, "acc_stderr": 0.039427724440366234, "acc_norm": 0.19, "acc_norm_stderr": 0.039427724440366234 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.21818181818181817, "acc_stderr": 0.03225078108306289, "acc_norm": 0.21818181818181817, "acc_norm_stderr": 0.03225078108306289 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.30808080808080807, "acc_stderr": 0.032894773300986155, "acc_norm": 0.30808080808080807, "acc_norm_stderr": 0.032894773300986155 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.36787564766839376, "acc_stderr": 0.03480175668466036, "acc_norm": 0.36787564766839376, "acc_norm_stderr": 0.03480175668466036 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.36153846153846153, "acc_stderr": 0.024359581465396983, "acc_norm": 0.36153846153846153, "acc_norm_stderr": 0.024359581465396983 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.26296296296296295, "acc_stderr": 0.026842057873833706, "acc_norm": 0.26296296296296295, "acc_norm_stderr": 0.026842057873833706 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.3445378151260504, "acc_stderr": 0.030868682604121633, "acc_norm": 0.3445378151260504, "acc_norm_stderr": 0.030868682604121633 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.33112582781456956, "acc_stderr": 0.038425817186598696, "acc_norm": 0.33112582781456956, "acc_norm_stderr": 0.038425817186598696 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.3155963302752294, "acc_stderr": 0.019926117513869666, "acc_norm": 0.3155963302752294, "acc_norm_stderr": 0.019926117513869666 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.4722222222222222, "acc_stderr": 0.0340470532865388, "acc_norm": 0.4722222222222222, "acc_norm_stderr": 0.0340470532865388 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.24019607843137256, "acc_stderr": 0.02998373305591361, "acc_norm": 0.24019607843137256, "acc_norm_stderr": 0.02998373305591361 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.22362869198312235, "acc_stderr": 0.027123298205229972, "acc_norm": 0.22362869198312235, "acc_norm_stderr": 0.027123298205229972 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.11659192825112108, "acc_stderr": 0.02153963981624447, "acc_norm": 0.11659192825112108, "acc_norm_stderr": 0.02153963981624447 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.2595419847328244, "acc_stderr": 0.03844876139785271, "acc_norm": 0.2595419847328244, "acc_norm_stderr": 0.03844876139785271 }, "harness|hendrycksTest-international_law|5": { "acc": 0.371900826446281, "acc_stderr": 0.04412015806624505, "acc_norm": 0.371900826446281, "acc_norm_stderr": 0.04412015806624505 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.21296296296296297, "acc_stderr": 0.0395783547198098, "acc_norm": 0.21296296296296297, "acc_norm_stderr": 0.0395783547198098 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.2331288343558282, "acc_stderr": 0.033220157957767414, "acc_norm": 0.2331288343558282, "acc_norm_stderr": 0.033220157957767414 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.16964285714285715, "acc_stderr": 0.0356236785009539, "acc_norm": 0.16964285714285715, "acc_norm_stderr": 0.0356236785009539 }, "harness|hendrycksTest-management|5": { "acc": 0.36893203883495146, "acc_stderr": 0.047776151811567386, "acc_norm": 0.36893203883495146, "acc_norm_stderr": 0.047776151811567386 }, "harness|hendrycksTest-marketing|5": { "acc": 0.19658119658119658, "acc_stderr": 0.02603538609895129, "acc_norm": 0.19658119658119658, "acc_norm_stderr": 0.02603538609895129 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.23, "acc_stderr": 0.04229525846816507, "acc_norm": 0.23, "acc_norm_stderr": 0.04229525846816507 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.20561941251596424, "acc_stderr": 0.014452500456785825, "acc_norm": 0.20561941251596424, "acc_norm_stderr": 0.014452500456785825 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.2254335260115607, "acc_stderr": 0.022497230190967547, "acc_norm": 0.2254335260115607, "acc_norm_stderr": 0.022497230190967547 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.27262569832402234, "acc_stderr": 0.014893391735249588, "acc_norm": 0.27262569832402234, "acc_norm_stderr": 0.014893391735249588 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.2908496732026144, "acc_stderr": 0.026004800363952113, "acc_norm": 0.2908496732026144, "acc_norm_stderr": 0.026004800363952113 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.24115755627009647, "acc_stderr": 0.024296594034763426, "acc_norm": 0.24115755627009647, "acc_norm_stderr": 0.024296594034763426 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.2222222222222222, "acc_stderr": 0.023132376234543343, "acc_norm": 0.2222222222222222, "acc_norm_stderr": 0.023132376234543343 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.2872340425531915, "acc_stderr": 0.026992199173064356, "acc_norm": 0.2872340425531915, "acc_norm_stderr": 0.026992199173064356 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.24967405475880053, "acc_stderr": 0.011054538377832329, "acc_norm": 0.24967405475880053, "acc_norm_stderr": 0.011054538377832329 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.4485294117647059, "acc_stderr": 0.030211479609121593, "acc_norm": 0.4485294117647059, "acc_norm_stderr": 0.030211479609121593 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.21568627450980393, "acc_stderr": 0.01663931935031326, "acc_norm": 0.21568627450980393, "acc_norm_stderr": 0.01663931935031326 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.22727272727272727, "acc_stderr": 0.04013964554072774, "acc_norm": 0.22727272727272727, "acc_norm_stderr": 0.04013964554072774 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.2571428571428571, "acc_stderr": 0.027979823538744546, "acc_norm": 0.2571428571428571, "acc_norm_stderr": 0.027979823538744546 }, "harness|hendrycksTest-sociology|5": { "acc": 0.23880597014925373, "acc_stderr": 0.030147775935409224, "acc_norm": 0.23880597014925373, "acc_norm_stderr": 0.030147775935409224 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.27, "acc_stderr": 0.04461960433384739, "acc_norm": 0.27, "acc_norm_stderr": 0.04461960433384739 }, "harness|hendrycksTest-virology|5": { "acc": 0.19879518072289157, "acc_stderr": 0.03106939026078943, "acc_norm": 0.19879518072289157, "acc_norm_stderr": 0.03106939026078943 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.17543859649122806, "acc_stderr": 0.029170885500727654, "acc_norm": 0.17543859649122806, "acc_norm_stderr": 0.029170885500727654 }, "harness|truthfulqa:mc|0": { "mc1": 0.23133414932680538, "mc1_stderr": 0.014761945174862671, "mc2": 0.43041471274187126, "mc2_stderr": 0.01495839156133983 }, "harness|winogrande|5": { "acc": 0.5122336227308603, "acc_stderr": 0.01404827882040562 }, "harness|gsm8k|5": { "acc": 0.003032600454890068, "acc_stderr": 0.0015145735612245494 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]

该数据集是在模型BFauber/opt125m_10e5_lr2e-7的评估运行期间自动创建的,用于Open LLM Leaderboard的评估。数据集由63个配置组成,每个配置对应一个评估任务。数据集从1次运行中创建,每次运行都可以在特定配置中找到,且每个配置的split以运行的时间戳命名。此外,数据集还包含一个名为"results"的配置,用于存储所有运行的聚合结果,并在Open LLM Leaderboard上显示聚合指标。
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

数据集名称

  • pretty_name: Evaluation run of BFauber/opt125m_10e5_lr2e-7

数据集描述

  • dataset_summary: 该数据集是在评估模型BFauber/opt125m_10e5_lr2e-7的过程中自动创建的,用于Open LLM Leaderboard的评估。

数据集组成

  • 包含63个配置,每个配置对应一个评估任务。
  • 数据集由1次运行创建,每次运行在每个配置中作为一个特定的分割存在,分割名称使用运行的时间戳命名。
  • “train”分割始终指向最新结果。
  • 额外配置“results”存储了所有运行的聚合结果,用于计算和显示聚合指标。

数据集加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_BFauber__opt125m_10e5_lr2e-7", "harness_winogrande_5", split="train")

最新结果

  • 最新结果来自2024-04-23T19:20:56.142177的运行,包含多个任务的评估数据。

数据集配置详情

配置列表

  • harness_arc_challenge_25
  • harness_gsm8k_5
  • harness_hellaswag_10
  • harness_hendrycksTest_5
    • 包含多个子任务,如abstract_algebra, anatomy, astronomy等。

每个配置包含的数据文件根据时间戳和“latest”分割进行组织,确保访问到最新的评估数据。

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,Open LLM Leaderboard 提供了一个标准化评测框架。本数据集专为记录模型 BFauber/opt125m_10e5_lr2e-7 的评估结果而自动生成。数据集包含63个配置,每个配置对应一项评测任务,并基于单次运行创建。每次运行结果以时间戳命名作为独立分割,而 'train' 分割则始终指向最新结果。此外,'results' 配置存储所有聚合指标,用于在 Leaderboard 上计算和展示综合性能。
使用方法
使用 Hugging Face Datasets 库可便捷地加载数据。例如,通过 load_dataset("open-llm-leaderboard/details_BFauber__opt125m_10e5_lr2e-7", "harness_winogrande_5", split="train") 即可获取 Winogrande 任务的最新评估详情。各配置的 Parquet 文件按任务和运行时间组织,用户可根据需要指定特定分割或遍历所有配置进行综合分析。这一接口为模型评估结果的二次开发与比较研究提供了坚实的数据基础。
背景与挑战
背景概述
在大型语言模型(LLM)能力评估领域,标准化评测体系的构建已成为推动模型发展的重要基石。Open LLM Leaderboard由Hugging Face团队于2023年发起,旨在通过统一的多任务基准对开源LLM进行公平、透明的性能对比。该数据集作为Leaderboard的衍生产物,记录了BFauber/opt125m_10e5_lr2e-7这一基于OPT-125M微调模型的评测详情,涵盖ARC-Challenge、HellaSwag、MMLU、TruthfulQA、Winogrande及GSM8K等多项代表性任务。其核心研究问题在于探究较低参数规模模型在不同认知维度(如常识推理、数学求解、事实一致性)上的表现边界,为资源受限场景下的模型选型提供实证依据。该数据集通过结构化存储63个任务配置的细粒度结果,显著提升了评测过程的可复现性与透明度,对后续小模型能力分析具有重要参考价值。
当前挑战
该数据集所解决的领域问题在于LLM性能评估的碎片化与不可复现性。传统上,不同研究团队采用各异的数据集与评估协议,导致模型间比较缺乏公信力。Open LLM Leaderboard通过标准化评测流水线,将模型在多样任务上的表现纳入统一框架,但构建过程中面临多重挑战:其一,任务覆盖的广度与平衡性难以兼顾——MMLU涵盖57个学科,而GSM8K仅专注数学推理,不同任务难度差异可能导致模型能力被片面解读;其二,评估指标的选择存在争议,如TruthfulQA采用MC1与MC2双重标准,而Winogrande仅报告准确率,指标异构性增加了综合排名的复杂性;其三,自动创建数据集时需确保大规模评测中数据文件(如Parquet格式)的完整性与版本一致性,每次运行生成的独立split管理对存储与检索效率构成压力。
常用场景
经典使用场景
该数据集源于Open LLM Leaderboard对模型BFauber/opt125m_10e5_lr2e-7的自动评估流程,其核心用途在于系统性地记录与呈现大语言模型在多样化基准任务上的细粒度表现。具体而言,数据集涵盖了ARC-Challenge、HellaSwag、MMLU(涵盖57个学科子集)、TruthfulQA、Winogrande及GSM8K等经典评测任务,每一任务均以独立配置存储,并附有精确的准确率与标准误差等指标。研究者可通过加载特定配置与时间戳切分,回溯模型在特定任务上的推理结果,从而对模型能力进行多维度剖析与横向对比。
解决学术问题
该数据集有效回应了当前大语言模型评估中普遍存在的可重复性不足与细粒度分析缺失的学术困境。通过标准化存储模型在多个权威基准上的逐任务结果,它使得研究者能够精确诊断模型在常识推理、数学计算、知识理解及对抗性问答等维度的优势与短板。例如,从数据中可清晰观察到该模型在GSM8K上近乎随机的表现(acc仅0.003),而在Winogrande上则接近随机水平(acc约0.512),这些量化证据为剖析模型推理能力的局限性提供了坚实的数据基础,进而推动了针对模型薄弱环节的改进策略研究。
实际应用
在实际应用层面,该数据集为模型选型与性能基准测试提供了可靠的参考依据。开发者或企业可借助其中存储的评估结果,快速判断特定模型在目标场景(如教育问答、医疗知识检索、逻辑推理任务)中的适用性。例如,模型在MMLU的professional_medicine子集上达到44.85%的准确率,暗示其在医学领域具备一定基础认知能力,但尚不足以支撑临床决策;而在high_school_statistics上47.22%的表现则表明其在基础统计知识方面相对稳健。这种细粒度的能力图谱有助于引导模型在垂直领域的微调与部署决策。
数据集最近研究
最新研究方向
当前,大语言模型(LLM)的性能评估已成为推动模型迭代与优化的核心环节。该数据集记录了BFauber/opt125m_10e5_lr2e-7模型在Open LLM Leaderboard上的评测结果,涵盖ARC、HellaSwag、MMLU、TruthfulQA、Winogrande及GSM8K等多项基准任务,系统性地揭示了该模型在常识推理、知识理解、事实一致性及数学推理等维度的表现。值得注意的是,该模型在GSM8K任务上准确率极低(约0.3%),反映出小型模型在复杂数学推理中的显著局限;而在Winogrande上表现相对均衡(约51.2%),暗示其在代词消歧任务中具备一定基础能力。这一评测数据集不仅为模型开发者提供了细粒度的性能反馈,更作为开源评测体系的重要一环,推动了LLM评估的标准化与透明化进程,对于理解模型能力边界、引导后续研究具有重要参考价值。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作