five

open-llm-leaderboard/details_TwT-6__open_llm_leaderboard_demo2

收藏
Hugging Face2024-04-16 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_TwT-6__open_llm_leaderboard_demo2
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of TwT-6/open_llm_leaderboard_demo2 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [TwT-6/open_llm_leaderboard_demo2](https://huggingface.co/TwT-6/open_llm_leaderboard_demo2)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_TwT-6__open_llm_leaderboard_demo2\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-04-16T05:21:02.026941](https://huggingface.co/datasets/open-llm-leaderboard/details_TwT-6__open_llm_leaderboard_demo2/blob/main/results_2024-04-16T05-21-02.026941.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6451101992879201,\n\ \ \"acc_stderr\": 0.031551562875150714,\n \"acc_norm\": 0.6577240641395706,\n\ \ \"acc_norm_stderr\": 0.03239157909913464,\n \"mc1\": 0.3659730722154223,\n\ \ \"mc1_stderr\": 0.016862941684088376,\n \"mc2\": 0.5245484506267384,\n\ \ \"mc2_stderr\": 0.01522218662636776\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.5708191126279863,\n \"acc_stderr\": 0.014464085894870653,\n\ \ \"acc_norm\": 0.6220136518771331,\n \"acc_norm_stderr\": 0.014169664520303101\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6408086038637721,\n\ \ \"acc_stderr\": 0.004787829168255653,\n \"acc_norm\": 0.8375821549492133,\n\ \ \"acc_norm_stderr\": 0.0036807989505319148\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.33,\n \"acc_stderr\": 0.04725815626252605,\n \ \ \"acc_norm\": 0.33,\n \"acc_norm_stderr\": 0.04725815626252605\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6370370370370371,\n\ \ \"acc_stderr\": 0.04153948404742398,\n \"acc_norm\": 0.6370370370370371,\n\ \ \"acc_norm_stderr\": 0.04153948404742398\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.7236842105263158,\n \"acc_stderr\": 0.03639057569952928,\n\ \ \"acc_norm\": 0.7236842105263158,\n \"acc_norm_stderr\": 0.03639057569952928\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.69,\n\ \ \"acc_stderr\": 0.04648231987117316,\n \"acc_norm\": 0.69,\n \ \ \"acc_norm_stderr\": 0.04648231987117316\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7056603773584905,\n \"acc_stderr\": 0.02804918631569525,\n\ \ \"acc_norm\": 0.7056603773584905,\n \"acc_norm_stderr\": 0.02804918631569525\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7569444444444444,\n\ \ \"acc_stderr\": 0.03586879280080341,\n \"acc_norm\": 0.7569444444444444,\n\ \ \"acc_norm_stderr\": 0.03586879280080341\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.42,\n \"acc_stderr\": 0.049604496374885836,\n \ \ \"acc_norm\": 0.42,\n \"acc_norm_stderr\": 0.049604496374885836\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"\ acc\": 0.54,\n \"acc_stderr\": 0.05009082659620332,\n \"acc_norm\"\ : 0.54,\n \"acc_norm_stderr\": 0.05009082659620332\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.34,\n \"acc_stderr\": 0.04760952285695235,\n \ \ \"acc_norm\": 0.34,\n \"acc_norm_stderr\": 0.04760952285695235\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.653179190751445,\n\ \ \"acc_stderr\": 0.036291466701596636,\n \"acc_norm\": 0.653179190751445,\n\ \ \"acc_norm_stderr\": 0.036291466701596636\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.39215686274509803,\n \"acc_stderr\": 0.048580835742663434,\n\ \ \"acc_norm\": 0.39215686274509803,\n \"acc_norm_stderr\": 0.048580835742663434\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.81,\n \"acc_stderr\": 0.039427724440366234,\n \"acc_norm\": 0.81,\n\ \ \"acc_norm_stderr\": 0.039427724440366234\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.574468085106383,\n \"acc_stderr\": 0.03232146916224468,\n\ \ \"acc_norm\": 0.574468085106383,\n \"acc_norm_stderr\": 0.03232146916224468\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.49122807017543857,\n\ \ \"acc_stderr\": 0.04702880432049615,\n \"acc_norm\": 0.49122807017543857,\n\ \ \"acc_norm_stderr\": 0.04702880432049615\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5862068965517241,\n \"acc_stderr\": 0.04104269211806232,\n\ \ \"acc_norm\": 0.5862068965517241,\n \"acc_norm_stderr\": 0.04104269211806232\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.46296296296296297,\n \"acc_stderr\": 0.02568056464005688,\n \"\ acc_norm\": 0.46296296296296297,\n \"acc_norm_stderr\": 0.02568056464005688\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4365079365079365,\n\ \ \"acc_stderr\": 0.04435932892851466,\n \"acc_norm\": 0.4365079365079365,\n\ \ \"acc_norm_stderr\": 0.04435932892851466\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.38,\n \"acc_stderr\": 0.04878317312145632,\n \ \ \"acc_norm\": 0.38,\n \"acc_norm_stderr\": 0.04878317312145632\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.8032258064516129,\n\ \ \"acc_stderr\": 0.022616409420742018,\n \"acc_norm\": 0.8032258064516129,\n\ \ \"acc_norm_stderr\": 0.022616409420742018\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.47783251231527096,\n \"acc_stderr\": 0.03514528562175008,\n\ \ \"acc_norm\": 0.47783251231527096,\n \"acc_norm_stderr\": 0.03514528562175008\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.66,\n \"acc_stderr\": 0.04760952285695237,\n \"acc_norm\"\ : 0.66,\n \"acc_norm_stderr\": 0.04760952285695237\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.8303030303030303,\n \"acc_stderr\": 0.029311188674983134,\n\ \ \"acc_norm\": 0.8303030303030303,\n \"acc_norm_stderr\": 0.029311188674983134\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.803030303030303,\n \"acc_stderr\": 0.028335609732463355,\n \"\ acc_norm\": 0.803030303030303,\n \"acc_norm_stderr\": 0.028335609732463355\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.917098445595855,\n \"acc_stderr\": 0.01989934131572178,\n\ \ \"acc_norm\": 0.917098445595855,\n \"acc_norm_stderr\": 0.01989934131572178\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6538461538461539,\n \"acc_stderr\": 0.02412112541694119,\n \ \ \"acc_norm\": 0.6538461538461539,\n \"acc_norm_stderr\": 0.02412112541694119\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.362962962962963,\n \"acc_stderr\": 0.02931820364520686,\n \ \ \"acc_norm\": 0.362962962962963,\n \"acc_norm_stderr\": 0.02931820364520686\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6596638655462185,\n \"acc_stderr\": 0.030778057422931673,\n\ \ \"acc_norm\": 0.6596638655462185,\n \"acc_norm_stderr\": 0.030778057422931673\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.3576158940397351,\n \"acc_stderr\": 0.03913453431177258,\n \"\ acc_norm\": 0.3576158940397351,\n \"acc_norm_stderr\": 0.03913453431177258\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8385321100917431,\n \"acc_stderr\": 0.015776239256163248,\n \"\ acc_norm\": 0.8385321100917431,\n \"acc_norm_stderr\": 0.015776239256163248\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.5833333333333334,\n \"acc_stderr\": 0.033622774366080424,\n \"\ acc_norm\": 0.5833333333333334,\n \"acc_norm_stderr\": 0.033622774366080424\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.8627450980392157,\n \"acc_stderr\": 0.024152225962801584,\n \"\ acc_norm\": 0.8627450980392157,\n \"acc_norm_stderr\": 0.024152225962801584\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.8523206751054853,\n \"acc_stderr\": 0.02309432958259569,\n \ \ \"acc_norm\": 0.8523206751054853,\n \"acc_norm_stderr\": 0.02309432958259569\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6860986547085202,\n\ \ \"acc_stderr\": 0.03114679648297246,\n \"acc_norm\": 0.6860986547085202,\n\ \ \"acc_norm_stderr\": 0.03114679648297246\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7633587786259542,\n \"acc_stderr\": 0.03727673575596915,\n\ \ \"acc_norm\": 0.7633587786259542,\n \"acc_norm_stderr\": 0.03727673575596915\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.8016528925619835,\n \"acc_stderr\": 0.03640118271990945,\n \"\ acc_norm\": 0.8016528925619835,\n \"acc_norm_stderr\": 0.03640118271990945\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7777777777777778,\n\ \ \"acc_stderr\": 0.0401910747255735,\n \"acc_norm\": 0.7777777777777778,\n\ \ \"acc_norm_stderr\": 0.0401910747255735\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7607361963190185,\n \"acc_stderr\": 0.033519538795212696,\n\ \ \"acc_norm\": 0.7607361963190185,\n \"acc_norm_stderr\": 0.033519538795212696\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.4375,\n\ \ \"acc_stderr\": 0.04708567521880525,\n \"acc_norm\": 0.4375,\n \ \ \"acc_norm_stderr\": 0.04708567521880525\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.8252427184466019,\n \"acc_stderr\": 0.03760178006026621,\n\ \ \"acc_norm\": 0.8252427184466019,\n \"acc_norm_stderr\": 0.03760178006026621\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8888888888888888,\n\ \ \"acc_stderr\": 0.020588491316092368,\n \"acc_norm\": 0.8888888888888888,\n\ \ \"acc_norm_stderr\": 0.020588491316092368\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.72,\n \"acc_stderr\": 0.045126085985421276,\n \ \ \"acc_norm\": 0.72,\n \"acc_norm_stderr\": 0.045126085985421276\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8326947637292464,\n\ \ \"acc_stderr\": 0.013347327202920332,\n \"acc_norm\": 0.8326947637292464,\n\ \ \"acc_norm_stderr\": 0.013347327202920332\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7427745664739884,\n \"acc_stderr\": 0.02353292543104428,\n\ \ \"acc_norm\": 0.7427745664739884,\n \"acc_norm_stderr\": 0.02353292543104428\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.3843575418994413,\n\ \ \"acc_stderr\": 0.016269088663959402,\n \"acc_norm\": 0.3843575418994413,\n\ \ \"acc_norm_stderr\": 0.016269088663959402\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7352941176470589,\n \"acc_stderr\": 0.02526169121972949,\n\ \ \"acc_norm\": 0.7352941176470589,\n \"acc_norm_stderr\": 0.02526169121972949\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7170418006430869,\n\ \ \"acc_stderr\": 0.025583062489984813,\n \"acc_norm\": 0.7170418006430869,\n\ \ \"acc_norm_stderr\": 0.025583062489984813\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7438271604938271,\n \"acc_stderr\": 0.024288533637726095,\n\ \ \"acc_norm\": 0.7438271604938271,\n \"acc_norm_stderr\": 0.024288533637726095\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.5,\n \"acc_stderr\": 0.029827499313594685,\n \"acc_norm\"\ : 0.5,\n \"acc_norm_stderr\": 0.029827499313594685\n },\n \"harness|hendrycksTest-professional_law|5\"\ : {\n \"acc\": 0.48435462842242505,\n \"acc_stderr\": 0.012763982838120948,\n\ \ \"acc_norm\": 0.48435462842242505,\n \"acc_norm_stderr\": 0.012763982838120948\n\ \ },\n \"harness|hendrycksTest-professional_medicine|5\": {\n \"acc\"\ : 0.7352941176470589,\n \"acc_stderr\": 0.02679956202488765,\n \"\ acc_norm\": 0.7352941176470589,\n \"acc_norm_stderr\": 0.02679956202488765\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6879084967320261,\n \"acc_stderr\": 0.01874501120127766,\n \ \ \"acc_norm\": 0.6879084967320261,\n \"acc_norm_stderr\": 0.01874501120127766\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6818181818181818,\n\ \ \"acc_stderr\": 0.04461272175910509,\n \"acc_norm\": 0.6818181818181818,\n\ \ \"acc_norm_stderr\": 0.04461272175910509\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7387755102040816,\n \"acc_stderr\": 0.028123429335142797,\n\ \ \"acc_norm\": 0.7387755102040816,\n \"acc_norm_stderr\": 0.028123429335142797\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8507462686567164,\n\ \ \"acc_stderr\": 0.025196929874827072,\n \"acc_norm\": 0.8507462686567164,\n\ \ \"acc_norm_stderr\": 0.025196929874827072\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.89,\n \"acc_stderr\": 0.03144660377352202,\n \ \ \"acc_norm\": 0.89,\n \"acc_norm_stderr\": 0.03144660377352202\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5662650602409639,\n\ \ \"acc_stderr\": 0.03858158940685515,\n \"acc_norm\": 0.5662650602409639,\n\ \ \"acc_norm_stderr\": 0.03858158940685515\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8011695906432749,\n \"acc_stderr\": 0.030611116557432528,\n\ \ \"acc_norm\": 0.8011695906432749,\n \"acc_norm_stderr\": 0.030611116557432528\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.3659730722154223,\n\ \ \"mc1_stderr\": 0.016862941684088376,\n \"mc2\": 0.5245484506267384,\n\ \ \"mc2_stderr\": 0.01522218662636776\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7932123125493291,\n \"acc_stderr\": 0.011382566829235814\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.000758150113722517,\n \ \ \"acc_stderr\": 0.000758150113722539\n }\n}\n```" repo_url: https://huggingface.co/TwT-6/open_llm_leaderboard_demo2 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|arc:challenge|25_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|arc:challenge|25_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-04-16T05-21-02.026941.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|gsm8k|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|gsm8k|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hellaswag|10_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hellaswag|10_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-15T16-29-36.546610.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-16T05-21-02.026941.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-management|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-management|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-virology|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-virology|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-16T05-21-02.026941.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|truthfulqa:mc|0_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|truthfulqa:mc|0_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-04-16T05-21-02.026941.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_04_15T16_29_36.546610 path: - '**/details_harness|winogrande|5_2024-04-15T16-29-36.546610.parquet' - split: 2024_04_16T05_21_02.026941 path: - '**/details_harness|winogrande|5_2024-04-16T05-21-02.026941.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-04-16T05-21-02.026941.parquet' - config_name: results data_files: - split: 2024_04_15T16_29_36.546610 path: - results_2024-04-15T16-29-36.546610.parquet - split: 2024_04_16T05_21_02.026941 path: - results_2024-04-16T05-21-02.026941.parquet - split: latest path: - results_2024-04-16T05-21-02.026941.parquet --- # Dataset Card for Evaluation run of TwT-6/open_llm_leaderboard_demo2 <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [TwT-6/open_llm_leaderboard_demo2](https://huggingface.co/TwT-6/open_llm_leaderboard_demo2) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_TwT-6__open_llm_leaderboard_demo2", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-04-16T05:21:02.026941](https://huggingface.co/datasets/open-llm-leaderboard/details_TwT-6__open_llm_leaderboard_demo2/blob/main/results_2024-04-16T05-21-02.026941.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6451101992879201, "acc_stderr": 0.031551562875150714, "acc_norm": 0.6577240641395706, "acc_norm_stderr": 0.03239157909913464, "mc1": 0.3659730722154223, "mc1_stderr": 0.016862941684088376, "mc2": 0.5245484506267384, "mc2_stderr": 0.01522218662636776 }, "harness|arc:challenge|25": { "acc": 0.5708191126279863, "acc_stderr": 0.014464085894870653, "acc_norm": 0.6220136518771331, "acc_norm_stderr": 0.014169664520303101 }, "harness|hellaswag|10": { "acc": 0.6408086038637721, "acc_stderr": 0.004787829168255653, "acc_norm": 0.8375821549492133, "acc_norm_stderr": 0.0036807989505319148 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.33, "acc_stderr": 0.04725815626252605, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252605 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6370370370370371, "acc_stderr": 0.04153948404742398, "acc_norm": 0.6370370370370371, "acc_norm_stderr": 0.04153948404742398 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7236842105263158, "acc_stderr": 0.03639057569952928, "acc_norm": 0.7236842105263158, "acc_norm_stderr": 0.03639057569952928 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.69, "acc_stderr": 0.04648231987117316, "acc_norm": 0.69, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7056603773584905, "acc_stderr": 0.02804918631569525, "acc_norm": 0.7056603773584905, "acc_norm_stderr": 0.02804918631569525 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7569444444444444, "acc_stderr": 0.03586879280080341, "acc_norm": 0.7569444444444444, "acc_norm_stderr": 0.03586879280080341 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.42, "acc_stderr": 0.049604496374885836, "acc_norm": 0.42, "acc_norm_stderr": 0.049604496374885836 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.54, "acc_stderr": 0.05009082659620332, "acc_norm": 0.54, "acc_norm_stderr": 0.05009082659620332 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.34, "acc_stderr": 0.04760952285695235, "acc_norm": 0.34, "acc_norm_stderr": 0.04760952285695235 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.653179190751445, "acc_stderr": 0.036291466701596636, "acc_norm": 0.653179190751445, "acc_norm_stderr": 0.036291466701596636 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.39215686274509803, "acc_stderr": 0.048580835742663434, "acc_norm": 0.39215686274509803, "acc_norm_stderr": 0.048580835742663434 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.81, "acc_stderr": 0.039427724440366234, "acc_norm": 0.81, "acc_norm_stderr": 0.039427724440366234 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.574468085106383, "acc_stderr": 0.03232146916224468, "acc_norm": 0.574468085106383, "acc_norm_stderr": 0.03232146916224468 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.49122807017543857, "acc_stderr": 0.04702880432049615, "acc_norm": 0.49122807017543857, "acc_norm_stderr": 0.04702880432049615 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5862068965517241, "acc_stderr": 0.04104269211806232, "acc_norm": 0.5862068965517241, "acc_norm_stderr": 0.04104269211806232 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.46296296296296297, "acc_stderr": 0.02568056464005688, "acc_norm": 0.46296296296296297, "acc_norm_stderr": 0.02568056464005688 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4365079365079365, "acc_stderr": 0.04435932892851466, "acc_norm": 0.4365079365079365, "acc_norm_stderr": 0.04435932892851466 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.38, "acc_stderr": 0.04878317312145632, "acc_norm": 0.38, "acc_norm_stderr": 0.04878317312145632 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.8032258064516129, "acc_stderr": 0.022616409420742018, "acc_norm": 0.8032258064516129, "acc_norm_stderr": 0.022616409420742018 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.47783251231527096, "acc_stderr": 0.03514528562175008, "acc_norm": 0.47783251231527096, "acc_norm_stderr": 0.03514528562175008 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.66, "acc_stderr": 0.04760952285695237, "acc_norm": 0.66, "acc_norm_stderr": 0.04760952285695237 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.8303030303030303, "acc_stderr": 0.029311188674983134, "acc_norm": 0.8303030303030303, "acc_norm_stderr": 0.029311188674983134 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.803030303030303, "acc_stderr": 0.028335609732463355, "acc_norm": 0.803030303030303, "acc_norm_stderr": 0.028335609732463355 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.917098445595855, "acc_stderr": 0.01989934131572178, "acc_norm": 0.917098445595855, "acc_norm_stderr": 0.01989934131572178 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6538461538461539, "acc_stderr": 0.02412112541694119, "acc_norm": 0.6538461538461539, "acc_norm_stderr": 0.02412112541694119 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.362962962962963, "acc_stderr": 0.02931820364520686, "acc_norm": 0.362962962962963, "acc_norm_stderr": 0.02931820364520686 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6596638655462185, "acc_stderr": 0.030778057422931673, "acc_norm": 0.6596638655462185, "acc_norm_stderr": 0.030778057422931673 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.3576158940397351, "acc_stderr": 0.03913453431177258, "acc_norm": 0.3576158940397351, "acc_norm_stderr": 0.03913453431177258 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8385321100917431, "acc_stderr": 0.015776239256163248, "acc_norm": 0.8385321100917431, "acc_norm_stderr": 0.015776239256163248 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5833333333333334, "acc_stderr": 0.033622774366080424, "acc_norm": 0.5833333333333334, "acc_norm_stderr": 0.033622774366080424 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.8627450980392157, "acc_stderr": 0.024152225962801584, "acc_norm": 0.8627450980392157, "acc_norm_stderr": 0.024152225962801584 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.8523206751054853, "acc_stderr": 0.02309432958259569, "acc_norm": 0.8523206751054853, "acc_norm_stderr": 0.02309432958259569 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6860986547085202, "acc_stderr": 0.03114679648297246, "acc_norm": 0.6860986547085202, "acc_norm_stderr": 0.03114679648297246 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7633587786259542, "acc_stderr": 0.03727673575596915, "acc_norm": 0.7633587786259542, "acc_norm_stderr": 0.03727673575596915 }, "harness|hendrycksTest-international_law|5": { "acc": 0.8016528925619835, "acc_stderr": 0.03640118271990945, "acc_norm": 0.8016528925619835, "acc_norm_stderr": 0.03640118271990945 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7777777777777778, "acc_stderr": 0.0401910747255735, "acc_norm": 0.7777777777777778, "acc_norm_stderr": 0.0401910747255735 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7607361963190185, "acc_stderr": 0.033519538795212696, "acc_norm": 0.7607361963190185, "acc_norm_stderr": 0.033519538795212696 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.4375, "acc_stderr": 0.04708567521880525, "acc_norm": 0.4375, "acc_norm_stderr": 0.04708567521880525 }, "harness|hendrycksTest-management|5": { "acc": 0.8252427184466019, "acc_stderr": 0.03760178006026621, "acc_norm": 0.8252427184466019, "acc_norm_stderr": 0.03760178006026621 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8888888888888888, "acc_stderr": 0.020588491316092368, "acc_norm": 0.8888888888888888, "acc_norm_stderr": 0.020588491316092368 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.72, "acc_stderr": 0.045126085985421276, "acc_norm": 0.72, "acc_norm_stderr": 0.045126085985421276 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8326947637292464, "acc_stderr": 0.013347327202920332, "acc_norm": 0.8326947637292464, "acc_norm_stderr": 0.013347327202920332 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7427745664739884, "acc_stderr": 0.02353292543104428, "acc_norm": 0.7427745664739884, "acc_norm_stderr": 0.02353292543104428 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.3843575418994413, "acc_stderr": 0.016269088663959402, "acc_norm": 0.3843575418994413, "acc_norm_stderr": 0.016269088663959402 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7352941176470589, "acc_stderr": 0.02526169121972949, "acc_norm": 0.7352941176470589, "acc_norm_stderr": 0.02526169121972949 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7170418006430869, "acc_stderr": 0.025583062489984813, "acc_norm": 0.7170418006430869, "acc_norm_stderr": 0.025583062489984813 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7438271604938271, "acc_stderr": 0.024288533637726095, "acc_norm": 0.7438271604938271, "acc_norm_stderr": 0.024288533637726095 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.5, "acc_stderr": 0.029827499313594685, "acc_norm": 0.5, "acc_norm_stderr": 0.029827499313594685 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.48435462842242505, "acc_stderr": 0.012763982838120948, "acc_norm": 0.48435462842242505, "acc_norm_stderr": 0.012763982838120948 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.7352941176470589, "acc_stderr": 0.02679956202488765, "acc_norm": 0.7352941176470589, "acc_norm_stderr": 0.02679956202488765 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6879084967320261, "acc_stderr": 0.01874501120127766, "acc_norm": 0.6879084967320261, "acc_norm_stderr": 0.01874501120127766 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6818181818181818, "acc_stderr": 0.04461272175910509, "acc_norm": 0.6818181818181818, "acc_norm_stderr": 0.04461272175910509 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7387755102040816, "acc_stderr": 0.028123429335142797, "acc_norm": 0.7387755102040816, "acc_norm_stderr": 0.028123429335142797 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8507462686567164, "acc_stderr": 0.025196929874827072, "acc_norm": 0.8507462686567164, "acc_norm_stderr": 0.025196929874827072 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.89, "acc_stderr": 0.03144660377352202, "acc_norm": 0.89, "acc_norm_stderr": 0.03144660377352202 }, "harness|hendrycksTest-virology|5": { "acc": 0.5662650602409639, "acc_stderr": 0.03858158940685515, "acc_norm": 0.5662650602409639, "acc_norm_stderr": 0.03858158940685515 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8011695906432749, "acc_stderr": 0.030611116557432528, "acc_norm": 0.8011695906432749, "acc_norm_stderr": 0.030611116557432528 }, "harness|truthfulqa:mc|0": { "mc1": 0.3659730722154223, "mc1_stderr": 0.016862941684088376, "mc2": 0.5245484506267384, "mc2_stderr": 0.01522218662636776 }, "harness|winogrande|5": { "acc": 0.7932123125493291, "acc_stderr": 0.011382566829235814 }, "harness|gsm8k|5": { "acc": 0.000758150113722517, "acc_stderr": 0.000758150113722539 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

数据集名称

  • pretty_name: Evaluation run of TwT-6/open_llm_leaderboard_demo2

数据集描述

数据集组成

  • 配置数量: 63个
  • 每个配置对应一个评估任务
  • 数据来源: 从2次运行中创建
  • 数据分割: 每次运行对应一个特定的分割,分割名称使用运行的时间戳命名
  • "train"分割: 始终指向最新的结果
  • "results"配置: 存储所有运行的聚合结果,用于计算和显示聚合指标

数据集加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_TwT-6__open_llm_leaderboard_demo2", "harness_winogrande_5", split="train")

最新结果

  • 结果来源: 来自2024-04-16T05:21:02.026941的运行
  • 结果内容: 包含多个任务的评估结果,如准确率(acc)、标准误差(acc_stderr)等

数据集配置详情

配置列表

  • harness_arc_challenge_25
  • harness_gsm8k_5
  • harness_hellaswag_10
  • harness_hendrycksTest_5

每个配置包含多个数据文件,对应不同的运行时间戳和最新的数据分割。

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,Open LLM Leaderboard 为模型性能的标准化评测提供了重要平台。该数据集是模型 TwT-6/open_llm_leaderboard_demo2 在 Open LLM Leaderboard 上执行评估任务时自动生成的产物。数据集构建过程基于两次独立的评估运行,每次运行的结果均以时间戳命名的分割形式存储于各配置中,而“train”分割则始终指向最新一次的评估结果。数据集共包含 63 个配置,每个配置对应一项被评估的具体任务,另设有一个名为“results”的独立配置,用于聚合存储所有运行的综合指标,这些聚合数据被用于计算并在 Leaderboard 界面上展示模型的整体表现。
特点
该数据集的一个显著特征在于其结构化的多任务覆盖能力,它囊括了从常识推理(如 ARC、HellaSwag)到专业学科知识(如 MMLU 下属的数十个学科子集)以及数学推理(GSM8K)等多样化的评测维度。每个配置均按运行时间戳划分为多个分割,这种设计便于研究者追溯模型在不同时间点的性能演变。此外,数据集在存储粒度上达到了实例级别,Parquet 格式的高效压缩特性使得大规模评估结果的存储与检索更为便捷。通过“latest”分割的设定,用户能够直接获取模型在各项任务上最新的评测细节,无需手动筛选历史版本。
使用方法
研究人员可通过 Hugging Face 的 datasets 库便捷地加载该数据集。例如,使用 `load_dataset("open-llm-leaderboard/details_TwT-6__open_llm_leaderboard_demo2", "harness_winogrande_5", split="train")` 即可获取模型在 Winogrande 任务上的最新评估详情。加载时需指定具体的配置名称(如 "harness_arc_challenge_25")以及所需的数据分割(如 "train" 或具体的时间戳分割)。对于需要批量分析所有任务结果的场景,可遍历全部 63 个配置,或直接加载 "results" 配置以获取聚合后的整体指标。数据以 Parquet 格式存储,支持高效的列式查询与统计分析,便于用户进行深入的模型能力对比与评估结果复现。
背景与挑战
背景概述
该数据集源自HuggingFace社区主导的Open LLM Leaderboard评估框架,由Clementine等人于2024年4月创建,旨在系统性地追踪和比较各类开源大语言模型在多样化基准任务上的表现。其核心研究问题聚焦于如何通过标准化、可复现的评估流程,为日益增长的开源模型提供公正透明的性能度量。数据集记录了模型TwT-6/open_llm_leaderboard_demo2在63个配置项上的评测结果,涵盖ARC挑战集、HellaSwag常识推理、MMLU多学科知识、GSM8K数学推理及TruthfulQA事实性等经典任务,为社区理解模型能力边界提供了实证基础。作为Open LLM Leaderboard的有机组成部分,该数据集推动了模型评估从碎片化走向系统化,对促进开源大模型的良性竞争与技术迭代具有重要参考价值。
当前挑战
该数据集所解决的领域问题在于,大语言模型性能评估面临基准不统一、结果难以复现的困境,亟需一个涵盖多维度能力的标准化评测体系。在构建过程中,挑战体现在三个方面:其一,需协调63个异构任务(如ARC的25样本与GSM8K的5样本设置)的格式与采样策略,确保评估一致性;其二,评测流程需动态追踪多次运行结果(如2024-04-15与2024-04-16两次运行),并自动聚合最新数据至'train'分割,对数据版本管理提出高要求;其三,面对HendrycksTest中57个细粒度学科(从抽象代数到病毒学)的差异化评估,需设计兼容的配置结构,以平衡评测广度与计算效率,这直接影响了数据集的实用性与可扩展性。
常用场景
经典使用场景
该数据集主要用于评估和追踪大语言模型在Open LLM Leaderboard上的表现,涵盖ARC-Challenge、HellaSwag、MMLU、TruthfulQA、Winogrande和GSM8K等63个经典基准任务。研究者可通过加载特定配置与分割,获取模型在各项任务上的细粒度性能指标,从而系统性地比较不同模型的推理、常识理解、数学求解及知识掌握能力。
解决学术问题
该数据集解决了大语言模型标准化评估中缺乏统一、可复现比较基准的学术难题。通过提供多轮运行结果的聚合存储与版本化追踪,它使得研究者能够客观衡量模型在多样化任务上的泛化能力与鲁棒性,推动了模型性能透明化评估的规范化进程,为后续模型优化提供了可靠的数据支撑。
衍生相关工作
该数据集衍生了一系列相关工作,如基于其评估结果构建的模型排行榜推动了Open LLM Leaderboard社区的活跃发展,激励了包括LLaMA、Falcon等系列模型的迭代优化。此外,研究者利用其细粒度评估数据开展模型能力短板分析,催生了针对性增强训练策略与多任务学习方法的学术探索。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作