five

open-llm-leaderboard-old/details_facebook__opt-125m

收藏
Hugging Face2024-01-23 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_facebook__opt-125m
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of Facebook/OPT-125M dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [Facebook/OPT-125M](https://huggingface.co/Facebook/OPT-125M) on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 3 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_Facebook__OPT-125M\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-01-23T14:31:42.504661](https://huggingface.co/datasets/open-llm-leaderboard/details_Facebook__OPT-125M/blob/main/results_2024-01-23T14-31-42.504661.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.25971933524807705,\n\ \ \"acc_stderr\": 0.030727814194809005,\n \"acc_norm\": 0.26053348115143415,\n\ \ \"acc_norm_stderr\": 0.03151920852026647,\n \"mc1\": 0.23990208078335373,\n\ \ \"mc1_stderr\": 0.014948812679062133,\n \"mc2\": 0.42868550699768687,\n\ \ \"mc2_stderr\": 0.01505826026535896\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.20392491467576793,\n \"acc_stderr\": 0.011774262478702256,\n\ \ \"acc_norm\": 0.22866894197952217,\n \"acc_norm_stderr\": 0.012272853582540792\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.2920732921728739,\n\ \ \"acc_stderr\": 0.004537865171414025,\n \"acc_norm\": 0.3143796056562438,\n\ \ \"acc_norm_stderr\": 0.00463319482579384\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.27,\n \"acc_stderr\": 0.044619604333847415,\n \ \ \"acc_norm\": 0.27,\n \"acc_norm_stderr\": 0.044619604333847415\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.23703703703703705,\n\ \ \"acc_stderr\": 0.03673731683969506,\n \"acc_norm\": 0.23703703703703705,\n\ \ \"acc_norm_stderr\": 0.03673731683969506\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.2631578947368421,\n \"acc_stderr\": 0.03583496176361062,\n\ \ \"acc_norm\": 0.2631578947368421,\n \"acc_norm_stderr\": 0.03583496176361062\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.21,\n\ \ \"acc_stderr\": 0.040936018074033256,\n \"acc_norm\": 0.21,\n \ \ \"acc_norm_stderr\": 0.040936018074033256\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.21509433962264152,\n \"acc_stderr\": 0.025288394502891363,\n\ \ \"acc_norm\": 0.21509433962264152,\n \"acc_norm_stderr\": 0.025288394502891363\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.2222222222222222,\n\ \ \"acc_stderr\": 0.03476590104304134,\n \"acc_norm\": 0.2222222222222222,\n\ \ \"acc_norm_stderr\": 0.03476590104304134\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.28,\n \"acc_stderr\": 0.04512608598542127,\n \ \ \"acc_norm\": 0.28,\n \"acc_norm_stderr\": 0.04512608598542127\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.33,\n \"acc_stderr\": 0.04725815626252604,\n \"acc_norm\": 0.33,\n\ \ \"acc_norm_stderr\": 0.04725815626252604\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.26,\n \"acc_stderr\": 0.04408440022768077,\n \ \ \"acc_norm\": 0.26,\n \"acc_norm_stderr\": 0.04408440022768077\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.2023121387283237,\n\ \ \"acc_stderr\": 0.03063114553919882,\n \"acc_norm\": 0.2023121387283237,\n\ \ \"acc_norm_stderr\": 0.03063114553919882\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.37254901960784315,\n \"acc_stderr\": 0.04810840148082633,\n\ \ \"acc_norm\": 0.37254901960784315,\n \"acc_norm_stderr\": 0.04810840148082633\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.18,\n \"acc_stderr\": 0.038612291966536955,\n \"acc_norm\": 0.18,\n\ \ \"acc_norm_stderr\": 0.038612291966536955\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.3148936170212766,\n \"acc_stderr\": 0.03036358219723816,\n\ \ \"acc_norm\": 0.3148936170212766,\n \"acc_norm_stderr\": 0.03036358219723816\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.23684210526315788,\n\ \ \"acc_stderr\": 0.039994238792813344,\n \"acc_norm\": 0.23684210526315788,\n\ \ \"acc_norm_stderr\": 0.039994238792813344\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.2482758620689655,\n \"acc_stderr\": 0.0360010569272777,\n\ \ \"acc_norm\": 0.2482758620689655,\n \"acc_norm_stderr\": 0.0360010569272777\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.2566137566137566,\n \"acc_stderr\": 0.022494510767503154,\n \"\ acc_norm\": 0.2566137566137566,\n \"acc_norm_stderr\": 0.022494510767503154\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.14285714285714285,\n\ \ \"acc_stderr\": 0.03129843185743809,\n \"acc_norm\": 0.14285714285714285,\n\ \ \"acc_norm_stderr\": 0.03129843185743809\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.18,\n \"acc_stderr\": 0.038612291966536934,\n \ \ \"acc_norm\": 0.18,\n \"acc_norm_stderr\": 0.038612291966536934\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\"\ : 0.3161290322580645,\n \"acc_stderr\": 0.02645087448904277,\n \"\ acc_norm\": 0.3161290322580645,\n \"acc_norm_stderr\": 0.02645087448904277\n\ \ },\n \"harness|hendrycksTest-high_school_chemistry|5\": {\n \"acc\"\ : 0.2955665024630542,\n \"acc_stderr\": 0.032104944337514575,\n \"\ acc_norm\": 0.2955665024630542,\n \"acc_norm_stderr\": 0.032104944337514575\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.19,\n \"acc_stderr\": 0.039427724440366234,\n \"acc_norm\"\ : 0.19,\n \"acc_norm_stderr\": 0.039427724440366234\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.21212121212121213,\n \"acc_stderr\": 0.03192271569548299,\n\ \ \"acc_norm\": 0.21212121212121213,\n \"acc_norm_stderr\": 0.03192271569548299\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.2727272727272727,\n \"acc_stderr\": 0.03173071239071724,\n \"\ acc_norm\": 0.2727272727272727,\n \"acc_norm_stderr\": 0.03173071239071724\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.36787564766839376,\n \"acc_stderr\": 0.03480175668466036,\n\ \ \"acc_norm\": 0.36787564766839376,\n \"acc_norm_stderr\": 0.03480175668466036\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.34102564102564104,\n \"acc_stderr\": 0.02403548967633506,\n\ \ \"acc_norm\": 0.34102564102564104,\n \"acc_norm_stderr\": 0.02403548967633506\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.26296296296296295,\n \"acc_stderr\": 0.026842057873833706,\n \ \ \"acc_norm\": 0.26296296296296295,\n \"acc_norm_stderr\": 0.026842057873833706\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.35294117647058826,\n \"acc_stderr\": 0.031041941304059288,\n\ \ \"acc_norm\": 0.35294117647058826,\n \"acc_norm_stderr\": 0.031041941304059288\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.31788079470198677,\n \"acc_stderr\": 0.038020397601079024,\n \"\ acc_norm\": 0.31788079470198677,\n \"acc_norm_stderr\": 0.038020397601079024\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.23119266055045873,\n \"acc_stderr\": 0.01807575024163315,\n \"\ acc_norm\": 0.23119266055045873,\n \"acc_norm_stderr\": 0.01807575024163315\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.4722222222222222,\n \"acc_stderr\": 0.0340470532865388,\n \"acc_norm\"\ : 0.4722222222222222,\n \"acc_norm_stderr\": 0.0340470532865388\n },\n\ \ \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\": 0.25980392156862747,\n\ \ \"acc_stderr\": 0.03077855467869326,\n \"acc_norm\": 0.25980392156862747,\n\ \ \"acc_norm_stderr\": 0.03077855467869326\n },\n \"harness|hendrycksTest-high_school_world_history|5\"\ : {\n \"acc\": 0.25738396624472576,\n \"acc_stderr\": 0.02845882099146031,\n\ \ \"acc_norm\": 0.25738396624472576,\n \"acc_norm_stderr\": 0.02845882099146031\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.20179372197309417,\n\ \ \"acc_stderr\": 0.026936111912802273,\n \"acc_norm\": 0.20179372197309417,\n\ \ \"acc_norm_stderr\": 0.026936111912802273\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.2366412213740458,\n \"acc_stderr\": 0.03727673575596918,\n\ \ \"acc_norm\": 0.2366412213740458,\n \"acc_norm_stderr\": 0.03727673575596918\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.38016528925619836,\n \"acc_stderr\": 0.04431324501968432,\n \"\ acc_norm\": 0.38016528925619836,\n \"acc_norm_stderr\": 0.04431324501968432\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.21296296296296297,\n\ \ \"acc_stderr\": 0.0395783547198098,\n \"acc_norm\": 0.21296296296296297,\n\ \ \"acc_norm_stderr\": 0.0395783547198098\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.22085889570552147,\n \"acc_stderr\": 0.032591773927421776,\n\ \ \"acc_norm\": 0.22085889570552147,\n \"acc_norm_stderr\": 0.032591773927421776\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.15178571428571427,\n\ \ \"acc_stderr\": 0.034057028381856924,\n \"acc_norm\": 0.15178571428571427,\n\ \ \"acc_norm_stderr\": 0.034057028381856924\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.18446601941747573,\n \"acc_stderr\": 0.03840423627288276,\n\ \ \"acc_norm\": 0.18446601941747573,\n \"acc_norm_stderr\": 0.03840423627288276\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.19658119658119658,\n\ \ \"acc_stderr\": 0.02603538609895129,\n \"acc_norm\": 0.19658119658119658,\n\ \ \"acc_norm_stderr\": 0.02603538609895129\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.34,\n \"acc_stderr\": 0.047609522856952344,\n \ \ \"acc_norm\": 0.34,\n \"acc_norm_stderr\": 0.047609522856952344\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.2515964240102171,\n\ \ \"acc_stderr\": 0.01551732236552963,\n \"acc_norm\": 0.2515964240102171,\n\ \ \"acc_norm_stderr\": 0.01551732236552963\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.23121387283236994,\n \"acc_stderr\": 0.02269865716785571,\n\ \ \"acc_norm\": 0.23121387283236994,\n \"acc_norm_stderr\": 0.02269865716785571\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.2424581005586592,\n\ \ \"acc_stderr\": 0.014333522059217889,\n \"acc_norm\": 0.2424581005586592,\n\ \ \"acc_norm_stderr\": 0.014333522059217889\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.27124183006535946,\n \"acc_stderr\": 0.02545775669666788,\n\ \ \"acc_norm\": 0.27124183006535946,\n \"acc_norm_stderr\": 0.02545775669666788\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.2379421221864952,\n\ \ \"acc_stderr\": 0.024185150647818707,\n \"acc_norm\": 0.2379421221864952,\n\ \ \"acc_norm_stderr\": 0.024185150647818707\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.2932098765432099,\n \"acc_stderr\": 0.025329888171900926,\n\ \ \"acc_norm\": 0.2932098765432099,\n \"acc_norm_stderr\": 0.025329888171900926\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.2624113475177305,\n \"acc_stderr\": 0.026244920349843007,\n \ \ \"acc_norm\": 0.2624113475177305,\n \"acc_norm_stderr\": 0.026244920349843007\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.25358539765319427,\n\ \ \"acc_stderr\": 0.011111715336101132,\n \"acc_norm\": 0.25358539765319427,\n\ \ \"acc_norm_stderr\": 0.011111715336101132\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.4485294117647059,\n \"acc_stderr\": 0.030211479609121593,\n\ \ \"acc_norm\": 0.4485294117647059,\n \"acc_norm_stderr\": 0.030211479609121593\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.21895424836601307,\n \"acc_stderr\": 0.016729937565537537,\n \ \ \"acc_norm\": 0.21895424836601307,\n \"acc_norm_stderr\": 0.016729937565537537\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.22727272727272727,\n\ \ \"acc_stderr\": 0.04013964554072774,\n \"acc_norm\": 0.22727272727272727,\n\ \ \"acc_norm_stderr\": 0.04013964554072774\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.24897959183673468,\n \"acc_stderr\": 0.027682979522960234,\n\ \ \"acc_norm\": 0.24897959183673468,\n \"acc_norm_stderr\": 0.027682979522960234\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.23383084577114427,\n\ \ \"acc_stderr\": 0.029929415408348398,\n \"acc_norm\": 0.23383084577114427,\n\ \ \"acc_norm_stderr\": 0.029929415408348398\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.33,\n \"acc_stderr\": 0.04725815626252605,\n \ \ \"acc_norm\": 0.33,\n \"acc_norm_stderr\": 0.04725815626252605\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.20481927710843373,\n\ \ \"acc_stderr\": 0.03141784291663926,\n \"acc_norm\": 0.20481927710843373,\n\ \ \"acc_norm_stderr\": 0.03141784291663926\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.17543859649122806,\n \"acc_stderr\": 0.029170885500727654,\n\ \ \"acc_norm\": 0.17543859649122806,\n \"acc_norm_stderr\": 0.029170885500727654\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.23990208078335373,\n\ \ \"mc1_stderr\": 0.014948812679062133,\n \"mc2\": 0.42868550699768687,\n\ \ \"mc2_stderr\": 0.01505826026535896\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.516179952644041,\n \"acc_stderr\": 0.014045126130978601\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.002274450341167551,\n \ \ \"acc_stderr\": 0.0013121578148674316\n }\n}\n```" repo_url: https://huggingface.co/Facebook/OPT-125M leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|arc:challenge|25_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|arc:challenge|25_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-01-23T14-31-42.504661.parquet' - config_name: harness_drop_3 data_files: - split: 2023_10_19T00_45_29.121149 path: - '**/details_harness|drop|3_2023-10-19T00-45-29.121149.parquet' - split: latest path: - '**/details_harness|drop|3_2023-10-19T00-45-29.121149.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_10_19T00_45_29.121149 path: - '**/details_harness|gsm8k|5_2023-10-19T00-45-29.121149.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|gsm8k|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hellaswag|10_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hellaswag|10_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-management|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-07-19T14:00:10.742260.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-23T14-31-42.504661.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-international_law|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-management|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-management|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-marketing|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-sociology|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-virology|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-23T14-31-42.504661.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_07_19T14_00_10.742260 path: - '**/details_harness|truthfulqa:mc|0_2023-07-19T14:00:10.742260.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|truthfulqa:mc|0_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-01-23T14-31-42.504661.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_10_19T00_45_29.121149 path: - '**/details_harness|winogrande|5_2023-10-19T00-45-29.121149.parquet' - split: 2024_01_23T14_31_42.504661 path: - '**/details_harness|winogrande|5_2024-01-23T14-31-42.504661.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-01-23T14-31-42.504661.parquet' - config_name: results data_files: - split: 2023_07_19T14_00_10.742260 path: - results_2023-07-19T14:00:10.742260.parquet - split: 2023_10_19T00_45_29.121149 path: - results_2023-10-19T00-45-29.121149.parquet - split: 2024_01_23T14_31_42.504661 path: - results_2024-01-23T14-31-42.504661.parquet - split: latest path: - results_2024-01-23T14-31-42.504661.parquet --- # Dataset Card for Evaluation run of Facebook/OPT-125M <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [Facebook/OPT-125M](https://huggingface.co/Facebook/OPT-125M) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 3 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Facebook__OPT-125M", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-01-23T14:31:42.504661](https://huggingface.co/datasets/open-llm-leaderboard/details_Facebook__OPT-125M/blob/main/results_2024-01-23T14-31-42.504661.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.25971933524807705, "acc_stderr": 0.030727814194809005, "acc_norm": 0.26053348115143415, "acc_norm_stderr": 0.03151920852026647, "mc1": 0.23990208078335373, "mc1_stderr": 0.014948812679062133, "mc2": 0.42868550699768687, "mc2_stderr": 0.01505826026535896 }, "harness|arc:challenge|25": { "acc": 0.20392491467576793, "acc_stderr": 0.011774262478702256, "acc_norm": 0.22866894197952217, "acc_norm_stderr": 0.012272853582540792 }, "harness|hellaswag|10": { "acc": 0.2920732921728739, "acc_stderr": 0.004537865171414025, "acc_norm": 0.3143796056562438, "acc_norm_stderr": 0.00463319482579384 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.27, "acc_stderr": 0.044619604333847415, "acc_norm": 0.27, "acc_norm_stderr": 0.044619604333847415 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.23703703703703705, "acc_stderr": 0.03673731683969506, "acc_norm": 0.23703703703703705, "acc_norm_stderr": 0.03673731683969506 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.2631578947368421, "acc_stderr": 0.03583496176361062, "acc_norm": 0.2631578947368421, "acc_norm_stderr": 0.03583496176361062 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.21, "acc_stderr": 0.040936018074033256, "acc_norm": 0.21, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.21509433962264152, "acc_stderr": 0.025288394502891363, "acc_norm": 0.21509433962264152, "acc_norm_stderr": 0.025288394502891363 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.2222222222222222, "acc_stderr": 0.03476590104304134, "acc_norm": 0.2222222222222222, "acc_norm_stderr": 0.03476590104304134 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.28, "acc_stderr": 0.04512608598542127, "acc_norm": 0.28, "acc_norm_stderr": 0.04512608598542127 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.33, "acc_stderr": 0.04725815626252604, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252604 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.26, "acc_stderr": 0.04408440022768077, "acc_norm": 0.26, "acc_norm_stderr": 0.04408440022768077 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.2023121387283237, "acc_stderr": 0.03063114553919882, "acc_norm": 0.2023121387283237, "acc_norm_stderr": 0.03063114553919882 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.37254901960784315, "acc_stderr": 0.04810840148082633, "acc_norm": 0.37254901960784315, "acc_norm_stderr": 0.04810840148082633 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.18, "acc_stderr": 0.038612291966536955, "acc_norm": 0.18, "acc_norm_stderr": 0.038612291966536955 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.3148936170212766, "acc_stderr": 0.03036358219723816, "acc_norm": 0.3148936170212766, "acc_norm_stderr": 0.03036358219723816 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.23684210526315788, "acc_stderr": 0.039994238792813344, "acc_norm": 0.23684210526315788, "acc_norm_stderr": 0.039994238792813344 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.2482758620689655, "acc_stderr": 0.0360010569272777, "acc_norm": 0.2482758620689655, "acc_norm_stderr": 0.0360010569272777 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.2566137566137566, "acc_stderr": 0.022494510767503154, "acc_norm": 0.2566137566137566, "acc_norm_stderr": 0.022494510767503154 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.14285714285714285, "acc_stderr": 0.03129843185743809, "acc_norm": 0.14285714285714285, "acc_norm_stderr": 0.03129843185743809 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.18, "acc_stderr": 0.038612291966536934, "acc_norm": 0.18, "acc_norm_stderr": 0.038612291966536934 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.3161290322580645, "acc_stderr": 0.02645087448904277, "acc_norm": 0.3161290322580645, "acc_norm_stderr": 0.02645087448904277 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.2955665024630542, "acc_stderr": 0.032104944337514575, "acc_norm": 0.2955665024630542, "acc_norm_stderr": 0.032104944337514575 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.19, "acc_stderr": 0.039427724440366234, "acc_norm": 0.19, "acc_norm_stderr": 0.039427724440366234 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.21212121212121213, "acc_stderr": 0.03192271569548299, "acc_norm": 0.21212121212121213, "acc_norm_stderr": 0.03192271569548299 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.2727272727272727, "acc_stderr": 0.03173071239071724, "acc_norm": 0.2727272727272727, "acc_norm_stderr": 0.03173071239071724 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.36787564766839376, "acc_stderr": 0.03480175668466036, "acc_norm": 0.36787564766839376, "acc_norm_stderr": 0.03480175668466036 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.34102564102564104, "acc_stderr": 0.02403548967633506, "acc_norm": 0.34102564102564104, "acc_norm_stderr": 0.02403548967633506 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.26296296296296295, "acc_stderr": 0.026842057873833706, "acc_norm": 0.26296296296296295, "acc_norm_stderr": 0.026842057873833706 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.35294117647058826, "acc_stderr": 0.031041941304059288, "acc_norm": 0.35294117647058826, "acc_norm_stderr": 0.031041941304059288 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.31788079470198677, "acc_stderr": 0.038020397601079024, "acc_norm": 0.31788079470198677, "acc_norm_stderr": 0.038020397601079024 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.23119266055045873, "acc_stderr": 0.01807575024163315, "acc_norm": 0.23119266055045873, "acc_norm_stderr": 0.01807575024163315 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.4722222222222222, "acc_stderr": 0.0340470532865388, "acc_norm": 0.4722222222222222, "acc_norm_stderr": 0.0340470532865388 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.25980392156862747, "acc_stderr": 0.03077855467869326, "acc_norm": 0.25980392156862747, "acc_norm_stderr": 0.03077855467869326 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.25738396624472576, "acc_stderr": 0.02845882099146031, "acc_norm": 0.25738396624472576, "acc_norm_stderr": 0.02845882099146031 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.20179372197309417, "acc_stderr": 0.026936111912802273, "acc_norm": 0.20179372197309417, "acc_norm_stderr": 0.026936111912802273 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.2366412213740458, "acc_stderr": 0.03727673575596918, "acc_norm": 0.2366412213740458, "acc_norm_stderr": 0.03727673575596918 }, "harness|hendrycksTest-international_law|5": { "acc": 0.38016528925619836, "acc_stderr": 0.04431324501968432, "acc_norm": 0.38016528925619836, "acc_norm_stderr": 0.04431324501968432 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.21296296296296297, "acc_stderr": 0.0395783547198098, "acc_norm": 0.21296296296296297, "acc_norm_stderr": 0.0395783547198098 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.22085889570552147, "acc_stderr": 0.032591773927421776, "acc_norm": 0.22085889570552147, "acc_norm_stderr": 0.032591773927421776 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.15178571428571427, "acc_stderr": 0.034057028381856924, "acc_norm": 0.15178571428571427, "acc_norm_stderr": 0.034057028381856924 }, "harness|hendrycksTest-management|5": { "acc": 0.18446601941747573, "acc_stderr": 0.03840423627288276, "acc_norm": 0.18446601941747573, "acc_norm_stderr": 0.03840423627288276 }, "harness|hendrycksTest-marketing|5": { "acc": 0.19658119658119658, "acc_stderr": 0.02603538609895129, "acc_norm": 0.19658119658119658, "acc_norm_stderr": 0.02603538609895129 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.34, "acc_stderr": 0.047609522856952344, "acc_norm": 0.34, "acc_norm_stderr": 0.047609522856952344 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.2515964240102171, "acc_stderr": 0.01551732236552963, "acc_norm": 0.2515964240102171, "acc_norm_stderr": 0.01551732236552963 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.23121387283236994, "acc_stderr": 0.02269865716785571, "acc_norm": 0.23121387283236994, "acc_norm_stderr": 0.02269865716785571 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.2424581005586592, "acc_stderr": 0.014333522059217889, "acc_norm": 0.2424581005586592, "acc_norm_stderr": 0.014333522059217889 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.27124183006535946, "acc_stderr": 0.02545775669666788, "acc_norm": 0.27124183006535946, "acc_norm_stderr": 0.02545775669666788 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.2379421221864952, "acc_stderr": 0.024185150647818707, "acc_norm": 0.2379421221864952, "acc_norm_stderr": 0.024185150647818707 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.2932098765432099, "acc_stderr": 0.025329888171900926, "acc_norm": 0.2932098765432099, "acc_norm_stderr": 0.025329888171900926 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.2624113475177305, "acc_stderr": 0.026244920349843007, "acc_norm": 0.2624113475177305, "acc_norm_stderr": 0.026244920349843007 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.25358539765319427, "acc_stderr": 0.011111715336101132, "acc_norm": 0.25358539765319427, "acc_norm_stderr": 0.011111715336101132 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.4485294117647059, "acc_stderr": 0.030211479609121593, "acc_norm": 0.4485294117647059, "acc_norm_stderr": 0.030211479609121593 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.21895424836601307, "acc_stderr": 0.016729937565537537, "acc_norm": 0.21895424836601307, "acc_norm_stderr": 0.016729937565537537 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.22727272727272727, "acc_stderr": 0.04013964554072774, "acc_norm": 0.22727272727272727, "acc_norm_stderr": 0.04013964554072774 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.24897959183673468, "acc_stderr": 0.027682979522960234, "acc_norm": 0.24897959183673468, "acc_norm_stderr": 0.027682979522960234 }, "harness|hendrycksTest-sociology|5": { "acc": 0.23383084577114427, "acc_stderr": 0.029929415408348398, "acc_norm": 0.23383084577114427, "acc_norm_stderr": 0.029929415408348398 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.33, "acc_stderr": 0.04725815626252605, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252605 }, "harness|hendrycksTest-virology|5": { "acc": 0.20481927710843373, "acc_stderr": 0.03141784291663926, "acc_norm": 0.20481927710843373, "acc_norm_stderr": 0.03141784291663926 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.17543859649122806, "acc_stderr": 0.029170885500727654, "acc_norm": 0.17543859649122806, "acc_norm_stderr": 0.029170885500727654 }, "harness|truthfulqa:mc|0": { "mc1": 0.23990208078335373, "mc1_stderr": 0.014948812679062133, "mc2": 0.42868550699768687, "mc2_stderr": 0.01505826026535896 }, "harness|winogrande|5": { "acc": 0.516179952644041, "acc_stderr": 0.014045126130978601 }, "harness|gsm8k|5": { "acc": 0.002274450341167551, "acc_stderr": 0.0013121578148674316 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

该数据集是在评估模型Facebook/OPT-125MOpen LLM Leaderboard上的运行过程中自动创建的。数据集包含64个配置,每个配置对应一个评估任务。

数据集结构

  • 配置数量:64个配置
  • 运行次数:数据集从3次运行中创建,每次运行在每个配置中作为一个特定的分割存在,分割名称使用运行的时间戳。
  • 最新结果:"train"分割始终指向最新的结果。
  • 结果汇总:一个额外的配置"results"存储所有运行的汇总结果,用于计算和显示在Open LLM Leaderboard上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Facebook__OPT-125M", "harness_winogrande_5", split="train")

最新结果

以下是2024-01-23T14:31:42.504661运行的最新结果:

python { "all": { "acc": 0.25971933524807705, "acc_stderr": 0.030727814194809005, "acc_norm": 0.26053348115143415, "acc_norm_stderr": 0.03151920852026647, "mc1": 0.23990208078335373, "mc1_stderr": 0.014948812679062133, "mc2": 0.42868550699768687, "mc2_stderr": 0.01505826026535896 }, "harness|arc:challenge|25": { "acc": 0.20392491467576793, "acc_stderr": 0.011774262478702256, "acc_norm": 0.22866894197952217, "acc_norm_stderr": 0.012272853582540792 }, "harness|hellaswag|10": { "acc": 0.2920732921728739, "acc_stderr": 0.004537865171414025, "acc_norm": 0.3143796056562438, "acc_norm_stderr": 0.00463319482579384 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.27, "acc_stderr": 0.044619604333847415, "acc_norm": 0.27, "acc_norm_stderr": 0.044619604333847415 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.23703703703703705, "acc_stderr": 0.03673731683969506, "acc_norm": 0.23703703703703705, "acc_norm_stderr": 0.03673731683969506 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.2631578947368421, "acc_stderr": 0.03583496176361062, "acc_norm": 0.2631578947368421, "acc_norm_stderr": 0.03583496176361062 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.21, "acc_stderr": 0.040936018074033256, "acc_norm": 0.21, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.21509433962264152, "acc_stderr": 0.025288394502891363, "acc_norm": 0.21509433962264152, "acc_norm_stderr": 0.025288394502891363 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.2222222222222222, "acc_stderr": 0.03476590104304134, "acc_norm": 0.2222222222222222, "acc_norm_stderr": 0.03476590104304134 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.28, "acc_stderr": 0.04512608598542127, "acc_norm": 0.28, "acc_norm_stderr": 0.04512608598542127 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.33, "acc_stderr": 0.04725815626252604, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252604 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.26, "acc_stderr": 0.04408440022768077, "acc_norm": 0.26, "acc_norm_stderr": 0.04408440022768077 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.2023121387283237, "acc_stderr": 0.03063114553919882, "acc_norm": 0.2023121387283237, "acc_norm_stderr": 0.03063114553919882 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.37254901960784315, "acc_stderr": 0.04810840148082633, "acc_norm": 0.37254901960784315, "acc_norm_stderr": 0.04810840148082633 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.18, "acc_stderr": 0.038612291966536955, "acc_norm": 0.18, "acc_norm_stderr": 0.038612291966536955 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.3148936170212766, "acc_stderr": 0.03036358219723816, "acc_norm": 0.3148936170212766, "acc_norm_stderr": 0.03036358219723816 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.23684210526315788, "acc_stderr": 0.039994238792813344, "acc_norm": 0.23684210526315788, "acc_norm_stderr": 0.039994238792813344 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.2482758620689655, "acc_stderr": 0.0360010569272777, "acc_norm": 0.2482758620689655, "acc_norm_stderr": 0.0360010569272777 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.2566137566137566, "acc_stderr": 0.022494510767503154, "acc_norm": 0.2566137566137566, "acc_norm_stderr": 0.022494510767503154 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.14285714285714285, "acc_stderr": 0.03129843185743809, "acc_norm": 0.14285714285714285, "acc_norm_stderr": 0.03129843185743809 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.18, "acc_stderr": 0.038612291966536934, "acc_norm": 0.18, "acc_norm_stderr": 0.038612291966536934 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.3161290322580645, "acc_stderr": 0.02645087448904277, "acc_norm": 0.3161290322580645, "acc_norm_stderr": 0.02645087448904277 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.2955665024630542, "acc_stderr": 0.032104944337514575, "acc_norm": 0.2955665024630542, "acc_norm_stderr": 0.032104944337514575 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.19, "acc_stderr": 0.039427724440366234, "acc_norm": 0.19, "acc_norm_stderr": 0.039427724440366234 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.21212121212121213, "acc_stderr": 0.03192271569548299, "acc_norm": 0.21212121212121213, "acc_norm_stderr": 0.03192271569548299 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.2727272727272727, "acc_stderr": 0.03173071239071724, "acc_norm": 0.2727272727272727, "acc_norm_stderr": 0.03173071239071724 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.36787564766839376, "acc_stderr": 0.03480175668466036, "acc_norm": 0.36787564766839376, "acc_norm_stderr": 0.03480175668466036 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.34102564102564104, "acc_stderr": 0.02403548967633506, "acc_norm": 0.34102564102564104, "acc_norm_stderr": 0.02403548967633506 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.26296296296296295, "acc_stderr": 0.026842057873833706, "acc_norm": 0.262962962962962

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard评估流程的自动化产物而构建。其核心机制在于,当特定模型如Facebook/OPT-125M在Leaderboard平台上接受多任务基准测试时,系统会自动捕获并结构化每次评估运行的详细结果。数据集通过三次独立的评估运行生成,每次运行对应一个以时间戳命名的数据切分,最终整合为涵盖64种不同评测任务的配置集合。这种构建方式确保了评估过程的透明性与结果的可追溯性,为模型性能的纵向对比提供了精确的数据基础。
特点
该数据集的一个显著特点是其精细的结构化组织,它将复杂的多维度评估结果按任务配置清晰划分。每个配置对应一个具体的评测任务,例如ARC挑战赛或HellaSwag常识推理,并包含多个历史评估运行的数据切分。数据集还专门设有一个名为“results”的聚合配置,用于存储和计算模型在所有任务上的综合性能指标。这种设计不仅支持对模型在单一任务上表现的深度分析,也便于研究者从宏观层面把握模型的整体能力轮廓,体现了评估数据管理的系统性与完备性。
使用方法
为利用该数据集进行模型性能分析,研究者可通过Hugging Face的`datasets`库便捷加载。使用`load_dataset`函数并指定数据集名称、目标配置(如`harness_winogrande_5`)以及数据切分(通常为“latest”以获取最新结果),即可访问特定任务下的详细评估记录。加载后的数据以结构化格式呈现,包含准确率及其标准误等关键度量,支持进一步的数据处理与可视化。这种方法使得对OPT-125M等模型在不同知识领域和推理任务上的表现进行量化比较与深入研究成为可能。
背景与挑战
背景概述
在大型语言模型(LLM)迅猛发展的背景下,对模型性能进行系统化、标准化的评估成为推动领域进步的关键。HuggingFace团队于2023年创建的Open LLM Leaderboard,旨在构建一个透明、可复现的评估平台,以解决LLM在多样化任务上的能力量化问题。该平台通过集成多个权威基准测试,如ARC、HellaSwag、MMLU等,对模型进行多维度测评,从而为研究社区提供可靠的性能比较依据。数据集'open-llm-leaderboard-old/details_facebook__opt-125m'正是这一平台在评估Facebook OPT-125M模型过程中自动生成的详细结果记录,它涵盖了64种任务配置,反映了模型在常识推理、专业知识及数学解题等复杂场景下的表现,为模型优化与学术研究提供了宝贵的数据支撑。
当前挑战
该数据集所应对的核心挑战在于如何全面、公正地评估大型语言模型的多领域能力,尤其是在面对ARC挑战赛中的科学推理、HellaSwag的常识推断以及MMLU的大规模多任务理解时,模型需克服语义歧义与知识泛化的难题。在构建过程中,挑战主要体现在评估框架的集成与数据一致性维护上:一方面,需将异构的基准测试统一至标准化流程,确保评分指标的可比性;另一方面,多次评估运行产生的时序数据(如2023年7月至2024年1月的多次实验)需通过分片机制进行管理,以避免版本混淆并保障结果的可追溯性。此外,数据集的动态更新特性要求其结构能灵活适应新任务的纳入,这对元数据设计与存储效率提出了持续性的技术挑战。
常用场景
经典使用场景
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard的评估运行产物,其经典使用场景在于为模型性能提供标准化基准测试。研究人员通过加载数据集中的配置项,能够对Facebook/OPT-125M模型在64项任务上的表现进行细粒度分析,涵盖常识推理、学科知识及数学能力等多个维度。这种结构化的评估方式使得模型间的横向对比成为可能,为学术社区提供了可复现的评估框架。
实际应用
在实际应用层面,该数据集为工业界选择适配模型提供了决策依据。企业可根据具体场景需求,参考模型在特定任务(如专业医学、法律或编程)上的表现数据,筛选出最符合业务要求的语言模型。同时,数据集记录的多次评估运行结果能够追踪模型迭代过程中的性能变化,为持续集成和部署流程提供监控指标,降低了模型选型与技术落地的风险。
衍生相关工作
围绕该数据集衍生的经典工作主要体现在评估范式的创新与扩展。例如,后续研究基于其多任务评估框架,开发了针对特定领域(如代码生成、多语言理解)的专项评测集;同时,该数据集启发了对评估指标本身的反思与改进,催生了如动态评估、对抗性测试等新型评估方法。这些工作共同推动了大语言模型评估从单一指标向生态化、场景化评估体系的演进。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作