five

open-llm-leaderboard-old/details_Felladrin__Llama-68M-Chat-v1

收藏
Hugging Face2024-01-14 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_Felladrin__Llama-68M-Chat-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of Felladrin/Llama-68M-Chat-v1 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [Felladrin/Llama-68M-Chat-v1](https://huggingface.co/Felladrin/Llama-68M-Chat-v1)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_Felladrin__Llama-68M-Chat-v1\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-01-14T17:25:12.605913](https://huggingface.co/datasets/open-llm-leaderboard/details_Felladrin__Llama-68M-Chat-v1/blob/main/results_2024-01-14T17-25-12.605913.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.2518558528274769,\n\ \ \"acc_stderr\": 0.030387282193610175,\n \"acc_norm\": 0.25203959947439164,\n\ \ \"acc_norm_stderr\": 0.031196164528136557,\n \"mc1\": 0.2741738066095471,\n\ \ \"mc1_stderr\": 0.015616518497219376,\n \"mc2\": 0.4726841055154348,\n\ \ \"mc2_stderr\": 0.015727848850119193\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.1885665529010239,\n \"acc_stderr\": 0.011430897647675815,\n\ \ \"acc_norm\": 0.23293515358361774,\n \"acc_norm_stderr\": 0.012352507042617405\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.27693686516630156,\n\ \ \"acc_stderr\": 0.004465704810893541,\n \"acc_norm\": 0.28271260705038836,\n\ \ \"acc_norm_stderr\": 0.004493975527386726\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.27,\n \"acc_stderr\": 0.04461960433384741,\n \ \ \"acc_norm\": 0.27,\n \"acc_norm_stderr\": 0.04461960433384741\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.22962962962962963,\n\ \ \"acc_stderr\": 0.03633384414073461,\n \"acc_norm\": 0.22962962962962963,\n\ \ \"acc_norm_stderr\": 0.03633384414073461\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.17763157894736842,\n \"acc_stderr\": 0.031103182383123398,\n\ \ \"acc_norm\": 0.17763157894736842,\n \"acc_norm_stderr\": 0.031103182383123398\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.17,\n\ \ \"acc_stderr\": 0.0377525168068637,\n \"acc_norm\": 0.17,\n \ \ \"acc_norm_stderr\": 0.0377525168068637\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.21132075471698114,\n \"acc_stderr\": 0.025125766484827845,\n\ \ \"acc_norm\": 0.21132075471698114,\n \"acc_norm_stderr\": 0.025125766484827845\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.2152777777777778,\n\ \ \"acc_stderr\": 0.03437079344106135,\n \"acc_norm\": 0.2152777777777778,\n\ \ \"acc_norm_stderr\": 0.03437079344106135\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.26,\n \"acc_stderr\": 0.0440844002276808,\n \ \ \"acc_norm\": 0.26,\n \"acc_norm_stderr\": 0.0440844002276808\n },\n\ \ \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\": 0.23,\n\ \ \"acc_stderr\": 0.042295258468165065,\n \"acc_norm\": 0.23,\n \ \ \"acc_norm_stderr\": 0.042295258468165065\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.23,\n \"acc_stderr\": 0.042295258468165065,\n \ \ \"acc_norm\": 0.23,\n \"acc_norm_stderr\": 0.042295258468165065\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.30057803468208094,\n\ \ \"acc_stderr\": 0.03496101481191181,\n \"acc_norm\": 0.30057803468208094,\n\ \ \"acc_norm_stderr\": 0.03496101481191181\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.21568627450980393,\n \"acc_stderr\": 0.04092563958237654,\n\ \ \"acc_norm\": 0.21568627450980393,\n \"acc_norm_stderr\": 0.04092563958237654\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.19,\n \"acc_stderr\": 0.039427724440366234,\n \"acc_norm\": 0.19,\n\ \ \"acc_norm_stderr\": 0.039427724440366234\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.1829787234042553,\n \"acc_stderr\": 0.025276041000449966,\n\ \ \"acc_norm\": 0.1829787234042553,\n \"acc_norm_stderr\": 0.025276041000449966\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.21929824561403508,\n\ \ \"acc_stderr\": 0.03892431106518754,\n \"acc_norm\": 0.21929824561403508,\n\ \ \"acc_norm_stderr\": 0.03892431106518754\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.20689655172413793,\n \"acc_stderr\": 0.03375672449560554,\n\ \ \"acc_norm\": 0.20689655172413793,\n \"acc_norm_stderr\": 0.03375672449560554\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.25925925925925924,\n \"acc_stderr\": 0.022569897074918417,\n \"\ acc_norm\": 0.25925925925925924,\n \"acc_norm_stderr\": 0.022569897074918417\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.15079365079365079,\n\ \ \"acc_stderr\": 0.03200686497287392,\n \"acc_norm\": 0.15079365079365079,\n\ \ \"acc_norm_stderr\": 0.03200686497287392\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.3161290322580645,\n\ \ \"acc_stderr\": 0.02645087448904277,\n \"acc_norm\": 0.3161290322580645,\n\ \ \"acc_norm_stderr\": 0.02645087448904277\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.26108374384236455,\n \"acc_stderr\": 0.030903796952114468,\n\ \ \"acc_norm\": 0.26108374384236455,\n \"acc_norm_stderr\": 0.030903796952114468\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.17,\n \"acc_stderr\": 0.0377525168068637,\n \"acc_norm\"\ : 0.17,\n \"acc_norm_stderr\": 0.0377525168068637\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.2545454545454545,\n \"acc_stderr\": 0.03401506715249039,\n\ \ \"acc_norm\": 0.2545454545454545,\n \"acc_norm_stderr\": 0.03401506715249039\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.3383838383838384,\n \"acc_stderr\": 0.03371124142626302,\n \"\ acc_norm\": 0.3383838383838384,\n \"acc_norm_stderr\": 0.03371124142626302\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.33678756476683935,\n \"acc_stderr\": 0.03410780251836184,\n\ \ \"acc_norm\": 0.33678756476683935,\n \"acc_norm_stderr\": 0.03410780251836184\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.34102564102564104,\n \"acc_stderr\": 0.02403548967633507,\n\ \ \"acc_norm\": 0.34102564102564104,\n \"acc_norm_stderr\": 0.02403548967633507\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.26296296296296295,\n \"acc_stderr\": 0.026842057873833706,\n \ \ \"acc_norm\": 0.26296296296296295,\n \"acc_norm_stderr\": 0.026842057873833706\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.3445378151260504,\n \"acc_stderr\": 0.030868682604121633,\n\ \ \"acc_norm\": 0.3445378151260504,\n \"acc_norm_stderr\": 0.030868682604121633\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.33774834437086093,\n \"acc_stderr\": 0.03861557546255169,\n \"\ acc_norm\": 0.33774834437086093,\n \"acc_norm_stderr\": 0.03861557546255169\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.23853211009174313,\n \"acc_stderr\": 0.01827257581023187,\n \"\ acc_norm\": 0.23853211009174313,\n \"acc_norm_stderr\": 0.01827257581023187\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.4722222222222222,\n \"acc_stderr\": 0.0340470532865388,\n \"acc_norm\"\ : 0.4722222222222222,\n \"acc_norm_stderr\": 0.0340470532865388\n },\n\ \ \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\": 0.24509803921568626,\n\ \ \"acc_stderr\": 0.030190282453501947,\n \"acc_norm\": 0.24509803921568626,\n\ \ \"acc_norm_stderr\": 0.030190282453501947\n },\n \"harness|hendrycksTest-high_school_world_history|5\"\ : {\n \"acc\": 0.2489451476793249,\n \"acc_stderr\": 0.028146970599422644,\n\ \ \"acc_norm\": 0.2489451476793249,\n \"acc_norm_stderr\": 0.028146970599422644\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.21076233183856502,\n\ \ \"acc_stderr\": 0.027373095500540193,\n \"acc_norm\": 0.21076233183856502,\n\ \ \"acc_norm_stderr\": 0.027373095500540193\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.25190839694656486,\n \"acc_stderr\": 0.03807387116306086,\n\ \ \"acc_norm\": 0.25190839694656486,\n \"acc_norm_stderr\": 0.03807387116306086\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.24793388429752067,\n \"acc_stderr\": 0.03941897526516303,\n \"\ acc_norm\": 0.24793388429752067,\n \"acc_norm_stderr\": 0.03941897526516303\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.18518518518518517,\n\ \ \"acc_stderr\": 0.03755265865037181,\n \"acc_norm\": 0.18518518518518517,\n\ \ \"acc_norm_stderr\": 0.03755265865037181\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.2331288343558282,\n \"acc_stderr\": 0.033220157957767414,\n\ \ \"acc_norm\": 0.2331288343558282,\n \"acc_norm_stderr\": 0.033220157957767414\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.1875,\n\ \ \"acc_stderr\": 0.0370468111477387,\n \"acc_norm\": 0.1875,\n \ \ \"acc_norm_stderr\": 0.0370468111477387\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.22330097087378642,\n \"acc_stderr\": 0.04123553189891431,\n\ \ \"acc_norm\": 0.22330097087378642,\n \"acc_norm_stderr\": 0.04123553189891431\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.19658119658119658,\n\ \ \"acc_stderr\": 0.02603538609895129,\n \"acc_norm\": 0.19658119658119658,\n\ \ \"acc_norm_stderr\": 0.02603538609895129\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.2835249042145594,\n\ \ \"acc_stderr\": 0.01611731816683228,\n \"acc_norm\": 0.2835249042145594,\n\ \ \"acc_norm_stderr\": 0.01611731816683228\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.24855491329479767,\n \"acc_stderr\": 0.023267528432100174,\n\ \ \"acc_norm\": 0.24855491329479767,\n \"acc_norm_stderr\": 0.023267528432100174\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.24134078212290502,\n\ \ \"acc_stderr\": 0.014310999547961438,\n \"acc_norm\": 0.24134078212290502,\n\ \ \"acc_norm_stderr\": 0.014310999547961438\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.22549019607843138,\n \"acc_stderr\": 0.023929155517351298,\n\ \ \"acc_norm\": 0.22549019607843138,\n \"acc_norm_stderr\": 0.023929155517351298\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.2990353697749196,\n\ \ \"acc_stderr\": 0.026003301117885135,\n \"acc_norm\": 0.2990353697749196,\n\ \ \"acc_norm_stderr\": 0.026003301117885135\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.24691358024691357,\n \"acc_stderr\": 0.02399350170904211,\n\ \ \"acc_norm\": 0.24691358024691357,\n \"acc_norm_stderr\": 0.02399350170904211\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.2553191489361702,\n \"acc_stderr\": 0.02601199293090201,\n \ \ \"acc_norm\": 0.2553191489361702,\n \"acc_norm_stderr\": 0.02601199293090201\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.24511082138200782,\n\ \ \"acc_stderr\": 0.010986307870045517,\n \"acc_norm\": 0.24511082138200782,\n\ \ \"acc_norm_stderr\": 0.010986307870045517\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.4117647058823529,\n \"acc_stderr\": 0.029896163033125478,\n\ \ \"acc_norm\": 0.4117647058823529,\n \"acc_norm_stderr\": 0.029896163033125478\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.2549019607843137,\n \"acc_stderr\": 0.017630827375148383,\n \ \ \"acc_norm\": 0.2549019607843137,\n \"acc_norm_stderr\": 0.017630827375148383\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.2,\n\ \ \"acc_stderr\": 0.03831305140884603,\n \"acc_norm\": 0.2,\n \ \ \"acc_norm_stderr\": 0.03831305140884603\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.23673469387755103,\n \"acc_stderr\": 0.02721283588407316,\n\ \ \"acc_norm\": 0.23673469387755103,\n \"acc_norm_stderr\": 0.02721283588407316\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.23880597014925373,\n\ \ \"acc_stderr\": 0.030147775935409224,\n \"acc_norm\": 0.23880597014925373,\n\ \ \"acc_norm_stderr\": 0.030147775935409224\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.26,\n \"acc_stderr\": 0.04408440022768079,\n \ \ \"acc_norm\": 0.26,\n \"acc_norm_stderr\": 0.04408440022768079\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.25301204819277107,\n\ \ \"acc_stderr\": 0.033844291552331346,\n \"acc_norm\": 0.25301204819277107,\n\ \ \"acc_norm_stderr\": 0.033844291552331346\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.21052631578947367,\n \"acc_stderr\": 0.0312678171466318,\n\ \ \"acc_norm\": 0.21052631578947367,\n \"acc_norm_stderr\": 0.0312678171466318\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.2741738066095471,\n\ \ \"mc1_stderr\": 0.015616518497219376,\n \"mc2\": 0.4726841055154348,\n\ \ \"mc2_stderr\": 0.015727848850119193\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.5430149960536701,\n \"acc_stderr\": 0.01400038676159829\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.0,\n \"acc_stderr\"\ : 0.0\n }\n}\n```" repo_url: https://huggingface.co/Felladrin/Llama-68M-Chat-v1 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|arc:challenge|25_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-01-14T17-25-12.605913.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|gsm8k|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hellaswag|10_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-14T17-25-12.605913.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-management|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-14T17-25-12.605913.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|truthfulqa:mc|0_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-01-14T17-25-12.605913.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_01_14T17_25_12.605913 path: - '**/details_harness|winogrande|5_2024-01-14T17-25-12.605913.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-01-14T17-25-12.605913.parquet' - config_name: results data_files: - split: 2024_01_14T17_25_12.605913 path: - results_2024-01-14T17-25-12.605913.parquet - split: latest path: - results_2024-01-14T17-25-12.605913.parquet --- # Dataset Card for Evaluation run of Felladrin/Llama-68M-Chat-v1 <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [Felladrin/Llama-68M-Chat-v1](https://huggingface.co/Felladrin/Llama-68M-Chat-v1) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Felladrin__Llama-68M-Chat-v1", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-01-14T17:25:12.605913](https://huggingface.co/datasets/open-llm-leaderboard/details_Felladrin__Llama-68M-Chat-v1/blob/main/results_2024-01-14T17-25-12.605913.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.2518558528274769, "acc_stderr": 0.030387282193610175, "acc_norm": 0.25203959947439164, "acc_norm_stderr": 0.031196164528136557, "mc1": 0.2741738066095471, "mc1_stderr": 0.015616518497219376, "mc2": 0.4726841055154348, "mc2_stderr": 0.015727848850119193 }, "harness|arc:challenge|25": { "acc": 0.1885665529010239, "acc_stderr": 0.011430897647675815, "acc_norm": 0.23293515358361774, "acc_norm_stderr": 0.012352507042617405 }, "harness|hellaswag|10": { "acc": 0.27693686516630156, "acc_stderr": 0.004465704810893541, "acc_norm": 0.28271260705038836, "acc_norm_stderr": 0.004493975527386726 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.27, "acc_stderr": 0.04461960433384741, "acc_norm": 0.27, "acc_norm_stderr": 0.04461960433384741 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.22962962962962963, "acc_stderr": 0.03633384414073461, "acc_norm": 0.22962962962962963, "acc_norm_stderr": 0.03633384414073461 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.17763157894736842, "acc_stderr": 0.031103182383123398, "acc_norm": 0.17763157894736842, "acc_norm_stderr": 0.031103182383123398 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.17, "acc_stderr": 0.0377525168068637, "acc_norm": 0.17, "acc_norm_stderr": 0.0377525168068637 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.21132075471698114, "acc_stderr": 0.025125766484827845, "acc_norm": 0.21132075471698114, "acc_norm_stderr": 0.025125766484827845 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.2152777777777778, "acc_stderr": 0.03437079344106135, "acc_norm": 0.2152777777777778, "acc_norm_stderr": 0.03437079344106135 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.26, "acc_stderr": 0.0440844002276808, "acc_norm": 0.26, "acc_norm_stderr": 0.0440844002276808 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.23, "acc_stderr": 0.042295258468165065, "acc_norm": 0.23, "acc_norm_stderr": 0.042295258468165065 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.23, "acc_stderr": 0.042295258468165065, "acc_norm": 0.23, "acc_norm_stderr": 0.042295258468165065 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.30057803468208094, "acc_stderr": 0.03496101481191181, "acc_norm": 0.30057803468208094, "acc_norm_stderr": 0.03496101481191181 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.21568627450980393, "acc_stderr": 0.04092563958237654, "acc_norm": 0.21568627450980393, "acc_norm_stderr": 0.04092563958237654 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.19, "acc_stderr": 0.039427724440366234, "acc_norm": 0.19, "acc_norm_stderr": 0.039427724440366234 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.1829787234042553, "acc_stderr": 0.025276041000449966, "acc_norm": 0.1829787234042553, "acc_norm_stderr": 0.025276041000449966 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.21929824561403508, "acc_stderr": 0.03892431106518754, "acc_norm": 0.21929824561403508, "acc_norm_stderr": 0.03892431106518754 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.20689655172413793, "acc_stderr": 0.03375672449560554, "acc_norm": 0.20689655172413793, "acc_norm_stderr": 0.03375672449560554 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.25925925925925924, "acc_stderr": 0.022569897074918417, "acc_norm": 0.25925925925925924, "acc_norm_stderr": 0.022569897074918417 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.15079365079365079, "acc_stderr": 0.03200686497287392, "acc_norm": 0.15079365079365079, "acc_norm_stderr": 0.03200686497287392 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.3161290322580645, "acc_stderr": 0.02645087448904277, "acc_norm": 0.3161290322580645, "acc_norm_stderr": 0.02645087448904277 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.26108374384236455, "acc_stderr": 0.030903796952114468, "acc_norm": 0.26108374384236455, "acc_norm_stderr": 0.030903796952114468 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.17, "acc_stderr": 0.0377525168068637, "acc_norm": 0.17, "acc_norm_stderr": 0.0377525168068637 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.2545454545454545, "acc_stderr": 0.03401506715249039, "acc_norm": 0.2545454545454545, "acc_norm_stderr": 0.03401506715249039 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.3383838383838384, "acc_stderr": 0.03371124142626302, "acc_norm": 0.3383838383838384, "acc_norm_stderr": 0.03371124142626302 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.33678756476683935, "acc_stderr": 0.03410780251836184, "acc_norm": 0.33678756476683935, "acc_norm_stderr": 0.03410780251836184 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.34102564102564104, "acc_stderr": 0.02403548967633507, "acc_norm": 0.34102564102564104, "acc_norm_stderr": 0.02403548967633507 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.26296296296296295, "acc_stderr": 0.026842057873833706, "acc_norm": 0.26296296296296295, "acc_norm_stderr": 0.026842057873833706 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.3445378151260504, "acc_stderr": 0.030868682604121633, "acc_norm": 0.3445378151260504, "acc_norm_stderr": 0.030868682604121633 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.33774834437086093, "acc_stderr": 0.03861557546255169, "acc_norm": 0.33774834437086093, "acc_norm_stderr": 0.03861557546255169 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.23853211009174313, "acc_stderr": 0.01827257581023187, "acc_norm": 0.23853211009174313, "acc_norm_stderr": 0.01827257581023187 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.4722222222222222, "acc_stderr": 0.0340470532865388, "acc_norm": 0.4722222222222222, "acc_norm_stderr": 0.0340470532865388 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.24509803921568626, "acc_stderr": 0.030190282453501947, "acc_norm": 0.24509803921568626, "acc_norm_stderr": 0.030190282453501947 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.2489451476793249, "acc_stderr": 0.028146970599422644, "acc_norm": 0.2489451476793249, "acc_norm_stderr": 0.028146970599422644 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.21076233183856502, "acc_stderr": 0.027373095500540193, "acc_norm": 0.21076233183856502, "acc_norm_stderr": 0.027373095500540193 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.25190839694656486, "acc_stderr": 0.03807387116306086, "acc_norm": 0.25190839694656486, "acc_norm_stderr": 0.03807387116306086 }, "harness|hendrycksTest-international_law|5": { "acc": 0.24793388429752067, "acc_stderr": 0.03941897526516303, "acc_norm": 0.24793388429752067, "acc_norm_stderr": 0.03941897526516303 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.18518518518518517, "acc_stderr": 0.03755265865037181, "acc_norm": 0.18518518518518517, "acc_norm_stderr": 0.03755265865037181 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.2331288343558282, "acc_stderr": 0.033220157957767414, "acc_norm": 0.2331288343558282, "acc_norm_stderr": 0.033220157957767414 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.1875, "acc_stderr": 0.0370468111477387, "acc_norm": 0.1875, "acc_norm_stderr": 0.0370468111477387 }, "harness|hendrycksTest-management|5": { "acc": 0.22330097087378642, "acc_stderr": 0.04123553189891431, "acc_norm": 0.22330097087378642, "acc_norm_stderr": 0.04123553189891431 }, "harness|hendrycksTest-marketing|5": { "acc": 0.19658119658119658, "acc_stderr": 0.02603538609895129, "acc_norm": 0.19658119658119658, "acc_norm_stderr": 0.02603538609895129 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.2835249042145594, "acc_stderr": 0.01611731816683228, "acc_norm": 0.2835249042145594, "acc_norm_stderr": 0.01611731816683228 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.24855491329479767, "acc_stderr": 0.023267528432100174, "acc_norm": 0.24855491329479767, "acc_norm_stderr": 0.023267528432100174 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.24134078212290502, "acc_stderr": 0.014310999547961438, "acc_norm": 0.24134078212290502, "acc_norm_stderr": 0.014310999547961438 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.22549019607843138, "acc_stderr": 0.023929155517351298, "acc_norm": 0.22549019607843138, "acc_norm_stderr": 0.023929155517351298 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.2990353697749196, "acc_stderr": 0.026003301117885135, "acc_norm": 0.2990353697749196, "acc_norm_stderr": 0.026003301117885135 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.24691358024691357, "acc_stderr": 0.02399350170904211, "acc_norm": 0.24691358024691357, "acc_norm_stderr": 0.02399350170904211 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.2553191489361702, "acc_stderr": 0.02601199293090201, "acc_norm": 0.2553191489361702, "acc_norm_stderr": 0.02601199293090201 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.24511082138200782, "acc_stderr": 0.010986307870045517, "acc_norm": 0.24511082138200782, "acc_norm_stderr": 0.010986307870045517 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.4117647058823529, "acc_stderr": 0.029896163033125478, "acc_norm": 0.4117647058823529, "acc_norm_stderr": 0.029896163033125478 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.2549019607843137, "acc_stderr": 0.017630827375148383, "acc_norm": 0.2549019607843137, "acc_norm_stderr": 0.017630827375148383 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.2, "acc_stderr": 0.03831305140884603, "acc_norm": 0.2, "acc_norm_stderr": 0.03831305140884603 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.23673469387755103, "acc_stderr": 0.02721283588407316, "acc_norm": 0.23673469387755103, "acc_norm_stderr": 0.02721283588407316 }, "harness|hendrycksTest-sociology|5": { "acc": 0.23880597014925373, "acc_stderr": 0.030147775935409224, "acc_norm": 0.23880597014925373, "acc_norm_stderr": 0.030147775935409224 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.26, "acc_stderr": 0.04408440022768079, "acc_norm": 0.26, "acc_norm_stderr": 0.04408440022768079 }, "harness|hendrycksTest-virology|5": { "acc": 0.25301204819277107, "acc_stderr": 0.033844291552331346, "acc_norm": 0.25301204819277107, "acc_norm_stderr": 0.033844291552331346 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.21052631578947367, "acc_stderr": 0.0312678171466318, "acc_norm": 0.21052631578947367, "acc_norm_stderr": 0.0312678171466318 }, "harness|truthfulqa:mc|0": { "mc1": 0.2741738066095471, "mc1_stderr": 0.015616518497219376, "mc2": 0.4726841055154348, "mc2_stderr": 0.015727848850119193 }, "harness|winogrande|5": { "acc": 0.5430149960536701, "acc_stderr": 0.01400038676159829 }, "harness|gsm8k|5": { "acc": 0.0, "acc_stderr": 0.0 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

该数据集是在评估模型 Felladrin/Llama-68M-Chat-v1Open LLM Leaderboard 上的运行过程中自动创建的。

数据集组成

  • 数据集包含 63 个配置,每个配置对应一个评估任务。
  • 数据集由 1 次运行创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Felladrin__Llama-68M-Chat-v1", "harness_winogrande_5", split="train")

最新结果

以下是 最新结果 来自运行 2024-01-14T17:25:12.605913:

python { "all": { "acc": 0.2518558528274769, "acc_stderr": 0.030387282193610175, "acc_norm": 0.25203959947439164, "acc_norm_stderr": 0.031196164528136557, "mc1": 0.2741738066095471, "mc1_stderr": 0.015616518497219376, "mc2": 0.4726841055154348, "mc2_stderr": 0.015727848850119193 }, "harness|arc:challenge|25": { "acc": 0.1885665529010239, "acc_stderr": 0.011430897647675815, "acc_norm": 0.23293515358361774, "acc_norm_stderr": 0.012352507042617405 }, "harness|hellaswag|10": { "acc": 0.27693686516630156, "acc_stderr": 0.004465704810893541, "acc_norm": 0.28271260705038836, "acc_norm_stderr": 0.004493975527386726 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.27, "acc_stderr": 0.04461960433384741, "acc_norm": 0.27, "acc_norm_stderr": 0.04461960433384741 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.22962962962962963, "acc_stderr": 0.03633384414073461, "acc_norm": 0.22962962962962963, "acc_norm_stderr": 0.03633384414073461 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.17763157894736842, "acc_stderr": 0.031103182383123398, "acc_norm": 0.17763157894736842, "acc_norm_stderr": 0.031103182383123398 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.17, "acc_stderr": 0.0377525168068637, "acc_norm": 0.17, "acc_norm_stderr": 0.0377525168068637 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.21132075471698114, "acc_stderr": 0.025125766484827845, "acc_norm": 0.21132075471698114, "acc_norm_stderr": 0.025125766484827845 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.2152777777777778, "acc_stderr": 0.03437079344106135, "acc_norm": 0.2152777777777778, "acc_norm_stderr": 0.03437079344106135 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.26, "acc_stderr": 0.0440844002276808, "acc_norm": 0.26, "acc_norm_stderr": 0.0440844002276808 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.23, "acc_stderr": 0.042295258468165065, "acc_norm": 0.23, "acc_norm_stderr": 0.042295258468165065 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.23, "acc_stderr": 0.042295258468165065, "acc_norm": 0.23, "acc_norm_stderr": 0.042295258468165065 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.30057803468208094, "acc_stderr": 0.03496101481191181, "acc_norm": 0.30057803468208094, "acc_norm_stderr": 0.03496101481191181 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.21568627450980393, "acc_stderr": 0.04092563958237654, "acc_norm": 0.21568627450980393, "acc_norm_stderr": 0.04092563958237654 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.19, "acc_stderr": 0.039427724440366234, "acc_norm": 0.19, "acc_norm_stderr": 0.039427724440366234 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.1829787234042553, "acc_stderr": 0.025276041000449966, "acc_norm": 0.1829787234042553, "acc_norm_stderr": 0.025276041000449966 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.21929824561403508, "acc_stderr": 0.03892431106518754, "acc_norm": 0.21929824561403508, "acc_norm_stderr": 0.03892431106518754 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.20689655172413793, "acc_stderr": 0.03375672449560554, "acc_norm": 0.20689655172413793, "acc_norm_stderr": 0.03375672449560554 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.25925925925925924, "acc_stderr": 0.022569897074918417, "acc_norm": 0.25925925925925924, "acc_norm_stderr": 0.022569897074918417 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.15079365079365079, "acc_stderr": 0.03200686497287392, "acc_norm": 0.15079365079365079, "acc_norm_stderr": 0.03200686497287392 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.3161290322580645, "acc_stderr": 0.02645087448904277, "acc_norm": 0.3161290322580645, "acc_norm_stderr": 0.02645087448904277 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.26108374384236455, "acc_stderr": 0.030903796952114468, "acc_norm": 0.26108374384236455, "acc_norm_stderr": 0.030903796952114468 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.17, "acc_stderr": 0.0377525168068637, "acc_norm": 0.17, "acc_norm_stderr": 0.0377525168068637 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.2545454545454545, "acc_stderr": 0.03401506715249039, "acc_norm": 0.2545454545454545, "acc_norm_stderr": 0.03401506715249039 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.3383838383838384, "acc_stderr": 0.03371124142626302, "acc_norm": 0.3383838383838384, "acc_norm_stderr": 0.03371124142626302 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.33678756476683935, "acc_stderr": 0.03410780251836184, "acc_norm": 0.33678756476683935, "acc_norm_stderr": 0.03410780251836184 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.34102564102564104, "acc_stderr": 0.02403548967633507, "acc_norm": 0.34102564102564104, "acc_norm_stderr": 0.02403548967633507 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.26296296296296295, "acc_stderr": 0.026842057873833706,

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估的蓬勃发展中,Open LLM Leaderboard 为模型性能的量化提供了权威平台。该数据集正是在此背景下,针对 Felladrin/Llama-68M-Chat-v1 模型的一次完整评估运行而自动生成。其构建过程围绕 63 个评估任务展开,每个任务对应一个独立的配置项,细致记录了模型在各类基准上的表现。数据源自单次运行,每次运行的详细结果被存储为特定分割,并以时间戳命名,而“train”分割则始终指向最新的评估数据。此外,一个名为“results”的额外配置汇聚了运行的所有聚合指标,用于在排行榜上计算和展示模型的综合得分。
特点
该数据集最显著的特征在于其精细化的组织结构与动态更新机制。它通过 63 个配置项,将模型在 ARC-Challenge、HellaSwag、MMLU 多学科知识、TruthfulQA、Winogrande 及 GSM8K 等多样化任务上的评估细节分门别类地呈现。每个配置项内部,又依据运行时间戳划分出独立的数据分割,使得研究者能够追溯不同时间点的评估结果,而“train”分割始终指向最新成果,确保了数据的时效性。这种设计不仅提供了模型能力的全景式快照,更通过历史分割的保留,为分析模型性能的演变轨迹提供了可能。
使用方法
使用该数据集时,研究者可通过 Hugging Face 的 datasets 库灵活加载所需信息。例如,通过指定配置名称(如“harness_winogrande_5”)和分割标识(如“train”),即可获取该任务的最新评估详情。若需回顾特定历史运行的结果,则可将分割参数设为对应的时间戳字符串。加载后的数据以 Parquet 格式存储,便于高效处理。此外,存储在“results”配置中的聚合指标,可直接用于复现或验证 Open LLM Leaderboard 上展示的模型排行榜得分,为模型对比与学术研究提供了坚实的数据基础。
背景与挑战
背景概述
在大规模语言模型迅猛发展的当下,如何系统、公正地评估模型性能成为学界与工业界共同关注的焦点。Open LLM Leaderboard由HuggingFace团队于2023年创建,旨在为开源语言模型提供标准化评测平台,其核心研究问题在于构建一套涵盖多维度能力的评估框架,以衡量模型在推理、常识、数学及知识理解等方面的表现。该数据集作为Leaderboard的衍生品,记录了Felladrin等人提交的Llama-68M-Chat-v1模型在63项任务上的详细评测结果,涵盖ARC-Challenge、HellaSwag、GSM8K及MMLU等经典基准。尽管模型参数量仅68M,但其在多种任务上的表现揭示了小规模模型在特定场景下的潜力与局限,为后续轻量化模型研究提供了重要参考。该数据集的影响力体现在其透明化的评测流程与可复现性,推动了开源社区对模型能力边界的深入探索。
当前挑战
该数据集所反映的核心挑战在于小参数模型在复杂推理任务中的性能瓶颈。Llama-68M-Chat-v1在GSM8K数学推理任务上准确率为零,在ARC-Challenge科学推理中仅达23.3%,揭示了模型在需要多步逻辑推导与领域知识整合的场景下存在显著不足。构建过程中,数据集需整合来自不同基准的异构结果,面临任务格式不统一与评分标准差异化的技术难题,例如MMLU包含57个学科子任务,需分别解析其多项选择与归一化准确率。此外,评测流程的自动化要求确保数据采集的时序一致性,同一模型在不同时间戳的运行结果可能因环境配置或随机种子而产生波动,增加了结果归因的复杂性。这些挑战共同指向如何设计更鲁棒的评测协议,以准确反映模型真实能力并促进小模型的高效优化。
常用场景
经典使用场景
在大型语言模型评估领域,open-llm-leaderboard-old/details_Felladrin__Llama-68M-Chat-v1 数据集作为 Open LLM Leaderboard 的评估结果存储库,其经典使用场景在于系统性地记录和复现模型在多样化基准任务上的表现。该数据集包含 63 个配置,分别对应 ARC-Challenge、HellaSwag、MMLU、TruthfulQA、Winogrande 和 GSM8K 等任务,每个配置下存储了模型在特定任务上的详细得分和统计误差。研究者可通过加载特定任务的 split 数据,如 'harness_winogrande_5',精确获取模型在常识推理或数学推理等子任务上的细粒度表现,从而进行横向对比或纵向追踪模型性能的演化轨迹。
实际应用
在实际应用中,该数据集为模型选型和部署决策提供了量化依据。工程师或产品经理可依据数据集中的 aggregated results,快速比较不同模型在推理、常识理解与数学解题等维度的优劣,从而选择最适合特定业务场景的模型。例如,针对需要高精度常识推理的对话系统,可参考 Winogrande 和 HellaSwag 的得分;而对于追求事实准确性的知识问答场景,TruthfulQA 的 mc1 与 mc2 指标则成为关键参考。此外,数据集的时间戳 split 设计支持对模型迭代效果的持续监控,便于企业在模型更新时进行回归测试。
衍生相关工作
该数据集衍生了一系列关于模型评估方法论与性能分析的开创性工作。基于其结构化的评估记录,研究者构建了模型性能对比图谱,揭示了不同规模模型在特定任务上的能力边界。例如,Felladrin/Llama-68M-Chat-v1 在 GSM8K 上得分为零的现象,催生了关于小型模型数学推理瓶颈的深入分析。同时,数据集的 MMLU 学科细分结果被用于训练知识图谱补全模型,以预测模型在未见学科上的表现。此外,围绕该数据集的 leaderboard 排名机制,衍生出关于评估指标公平性与鲁棒性的讨论,推动了如标准化准确率等更稳健评估指标的应用。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作