five

open-llm-leaderboard-old/details_vicgalle__RoleBeagle-11B

收藏
Hugging Face2024-03-01 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_vicgalle__RoleBeagle-11B
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of vicgalle/RoleBeagle-11B dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [vicgalle/RoleBeagle-11B](https://huggingface.co/vicgalle/RoleBeagle-11B) on the\ \ [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_vicgalle__RoleBeagle-11B\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-03-01T02:00:58.489203](https://huggingface.co/datasets/open-llm-leaderboard/details_vicgalle__RoleBeagle-11B/blob/main/results_2024-03-01T02-00-58.489203.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6679514377558416,\n\ \ \"acc_stderr\": 0.03170912207730401,\n \"acc_norm\": 0.6685063830324032,\n\ \ \"acc_norm_stderr\": 0.03235896712705177,\n \"mc1\": 0.6181150550795593,\n\ \ \"mc1_stderr\": 0.017008101939163495,\n \"mc2\": 0.7792411418196358,\n\ \ \"mc2_stderr\": 0.013837551363048158\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.7030716723549488,\n \"acc_stderr\": 0.013352025976725225,\n\ \ \"acc_norm\": 0.7235494880546075,\n \"acc_norm_stderr\": 0.013069662474252425\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.7219677355108544,\n\ \ \"acc_stderr\": 0.004471137333619627,\n \"acc_norm\": 0.8977295359490142,\n\ \ \"acc_norm_stderr\": 0.0030238440318883764\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.37,\n \"acc_stderr\": 0.048523658709391,\n \ \ \"acc_norm\": 0.37,\n \"acc_norm_stderr\": 0.048523658709391\n },\n\ \ \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.562962962962963,\n\ \ \"acc_stderr\": 0.04284958639753401,\n \"acc_norm\": 0.562962962962963,\n\ \ \"acc_norm_stderr\": 0.04284958639753401\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.75,\n \"acc_stderr\": 0.03523807393012047,\n \ \ \"acc_norm\": 0.75,\n \"acc_norm_stderr\": 0.03523807393012047\n \ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.69,\n\ \ \"acc_stderr\": 0.04648231987117316,\n \"acc_norm\": 0.69,\n \ \ \"acc_norm_stderr\": 0.04648231987117316\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.6830188679245283,\n \"acc_stderr\": 0.028637235639800893,\n\ \ \"acc_norm\": 0.6830188679245283,\n \"acc_norm_stderr\": 0.028637235639800893\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7916666666666666,\n\ \ \"acc_stderr\": 0.03396116205845335,\n \"acc_norm\": 0.7916666666666666,\n\ \ \"acc_norm_stderr\": 0.03396116205845335\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.44,\n \"acc_stderr\": 0.04988876515698589,\n \ \ \"acc_norm\": 0.44,\n \"acc_norm_stderr\": 0.04988876515698589\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.53,\n \"acc_stderr\": 0.050161355804659205,\n \"acc_norm\": 0.53,\n\ \ \"acc_norm_stderr\": 0.050161355804659205\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.33,\n \"acc_stderr\": 0.047258156262526045,\n \ \ \"acc_norm\": 0.33,\n \"acc_norm_stderr\": 0.047258156262526045\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.653179190751445,\n\ \ \"acc_stderr\": 0.036291466701596636,\n \"acc_norm\": 0.653179190751445,\n\ \ \"acc_norm_stderr\": 0.036291466701596636\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.38235294117647056,\n \"acc_stderr\": 0.04835503696107223,\n\ \ \"acc_norm\": 0.38235294117647056,\n \"acc_norm_stderr\": 0.04835503696107223\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.75,\n \"acc_stderr\": 0.04351941398892446,\n \"acc_norm\": 0.75,\n\ \ \"acc_norm_stderr\": 0.04351941398892446\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.6340425531914894,\n \"acc_stderr\": 0.0314895582974553,\n\ \ \"acc_norm\": 0.6340425531914894,\n \"acc_norm_stderr\": 0.0314895582974553\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.543859649122807,\n\ \ \"acc_stderr\": 0.046854730419077895,\n \"acc_norm\": 0.543859649122807,\n\ \ \"acc_norm_stderr\": 0.046854730419077895\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5586206896551724,\n \"acc_stderr\": 0.04137931034482757,\n\ \ \"acc_norm\": 0.5586206896551724,\n \"acc_norm_stderr\": 0.04137931034482757\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.4947089947089947,\n \"acc_stderr\": 0.02574986828855657,\n \"\ acc_norm\": 0.4947089947089947,\n \"acc_norm_stderr\": 0.02574986828855657\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.3968253968253968,\n\ \ \"acc_stderr\": 0.04375888492727061,\n \"acc_norm\": 0.3968253968253968,\n\ \ \"acc_norm_stderr\": 0.04375888492727061\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.38,\n \"acc_stderr\": 0.04878317312145632,\n \ \ \"acc_norm\": 0.38,\n \"acc_norm_stderr\": 0.04878317312145632\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.8290322580645161,\n\ \ \"acc_stderr\": 0.02141724293632158,\n \"acc_norm\": 0.8290322580645161,\n\ \ \"acc_norm_stderr\": 0.02141724293632158\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.4876847290640394,\n \"acc_stderr\": 0.035169204442208966,\n\ \ \"acc_norm\": 0.4876847290640394,\n \"acc_norm_stderr\": 0.035169204442208966\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.69,\n \"acc_stderr\": 0.04648231987117316,\n \"acc_norm\"\ : 0.69,\n \"acc_norm_stderr\": 0.04648231987117316\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7818181818181819,\n \"acc_stderr\": 0.03225078108306289,\n\ \ \"acc_norm\": 0.7818181818181819,\n \"acc_norm_stderr\": 0.03225078108306289\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.8282828282828283,\n \"acc_stderr\": 0.02686971618742991,\n \"\ acc_norm\": 0.8282828282828283,\n \"acc_norm_stderr\": 0.02686971618742991\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.917098445595855,\n \"acc_stderr\": 0.01989934131572178,\n\ \ \"acc_norm\": 0.917098445595855,\n \"acc_norm_stderr\": 0.01989934131572178\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6692307692307692,\n \"acc_stderr\": 0.023854795680971135,\n\ \ \"acc_norm\": 0.6692307692307692,\n \"acc_norm_stderr\": 0.023854795680971135\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.34814814814814815,\n \"acc_stderr\": 0.029045600290616258,\n \ \ \"acc_norm\": 0.34814814814814815,\n \"acc_norm_stderr\": 0.029045600290616258\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.726890756302521,\n \"acc_stderr\": 0.028942004040998167,\n \ \ \"acc_norm\": 0.726890756302521,\n \"acc_norm_stderr\": 0.028942004040998167\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.4105960264900662,\n \"acc_stderr\": 0.04016689594849928,\n \"\ acc_norm\": 0.4105960264900662,\n \"acc_norm_stderr\": 0.04016689594849928\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8587155963302753,\n \"acc_stderr\": 0.014933868987028084,\n \"\ acc_norm\": 0.8587155963302753,\n \"acc_norm_stderr\": 0.014933868987028084\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.5740740740740741,\n \"acc_stderr\": 0.033723432716530624,\n \"\ acc_norm\": 0.5740740740740741,\n \"acc_norm_stderr\": 0.033723432716530624\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.8333333333333334,\n \"acc_stderr\": 0.02615686752393104,\n \"\ acc_norm\": 0.8333333333333334,\n \"acc_norm_stderr\": 0.02615686752393104\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.8481012658227848,\n \"acc_stderr\": 0.02336387809663245,\n \ \ \"acc_norm\": 0.8481012658227848,\n \"acc_norm_stderr\": 0.02336387809663245\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.726457399103139,\n\ \ \"acc_stderr\": 0.029918586707798827,\n \"acc_norm\": 0.726457399103139,\n\ \ \"acc_norm_stderr\": 0.029918586707798827\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7480916030534351,\n \"acc_stderr\": 0.03807387116306086,\n\ \ \"acc_norm\": 0.7480916030534351,\n \"acc_norm_stderr\": 0.03807387116306086\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.8099173553719008,\n \"acc_stderr\": 0.03581796951709282,\n \"\ acc_norm\": 0.8099173553719008,\n \"acc_norm_stderr\": 0.03581796951709282\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.8148148148148148,\n\ \ \"acc_stderr\": 0.03755265865037182,\n \"acc_norm\": 0.8148148148148148,\n\ \ \"acc_norm_stderr\": 0.03755265865037182\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.754601226993865,\n \"acc_stderr\": 0.03380939813943354,\n\ \ \"acc_norm\": 0.754601226993865,\n \"acc_norm_stderr\": 0.03380939813943354\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.5,\n\ \ \"acc_stderr\": 0.04745789978762494,\n \"acc_norm\": 0.5,\n \ \ \"acc_norm_stderr\": 0.04745789978762494\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.8155339805825242,\n \"acc_stderr\": 0.03840423627288276,\n\ \ \"acc_norm\": 0.8155339805825242,\n \"acc_norm_stderr\": 0.03840423627288276\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8803418803418803,\n\ \ \"acc_stderr\": 0.021262719400406957,\n \"acc_norm\": 0.8803418803418803,\n\ \ \"acc_norm_stderr\": 0.021262719400406957\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.74,\n \"acc_stderr\": 0.04408440022768078,\n \ \ \"acc_norm\": 0.74,\n \"acc_norm_stderr\": 0.04408440022768078\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8212005108556832,\n\ \ \"acc_stderr\": 0.013702643715368976,\n \"acc_norm\": 0.8212005108556832,\n\ \ \"acc_norm_stderr\": 0.013702643715368976\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7254335260115607,\n \"acc_stderr\": 0.02402774515526501,\n\ \ \"acc_norm\": 0.7254335260115607,\n \"acc_norm_stderr\": 0.02402774515526501\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.4569832402234637,\n\ \ \"acc_stderr\": 0.01666049858050917,\n \"acc_norm\": 0.4569832402234637,\n\ \ \"acc_norm_stderr\": 0.01666049858050917\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.738562091503268,\n \"acc_stderr\": 0.025160998214292456,\n\ \ \"acc_norm\": 0.738562091503268,\n \"acc_norm_stderr\": 0.025160998214292456\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7556270096463023,\n\ \ \"acc_stderr\": 0.02440616209466889,\n \"acc_norm\": 0.7556270096463023,\n\ \ \"acc_norm_stderr\": 0.02440616209466889\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7685185185185185,\n \"acc_stderr\": 0.02346842983245114,\n\ \ \"acc_norm\": 0.7685185185185185,\n \"acc_norm_stderr\": 0.02346842983245114\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.5319148936170213,\n \"acc_stderr\": 0.029766675075873866,\n \ \ \"acc_norm\": 0.5319148936170213,\n \"acc_norm_stderr\": 0.029766675075873866\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.5195567144719687,\n\ \ \"acc_stderr\": 0.012760464028289295,\n \"acc_norm\": 0.5195567144719687,\n\ \ \"acc_norm_stderr\": 0.012760464028289295\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.7463235294117647,\n \"acc_stderr\": 0.026431329870789496,\n\ \ \"acc_norm\": 0.7463235294117647,\n \"acc_norm_stderr\": 0.026431329870789496\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.7173202614379085,\n \"acc_stderr\": 0.01821726955205344,\n \ \ \"acc_norm\": 0.7173202614379085,\n \"acc_norm_stderr\": 0.01821726955205344\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.7,\n\ \ \"acc_stderr\": 0.04389311454644287,\n \"acc_norm\": 0.7,\n \ \ \"acc_norm_stderr\": 0.04389311454644287\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7510204081632653,\n \"acc_stderr\": 0.027682979522960238,\n\ \ \"acc_norm\": 0.7510204081632653,\n \"acc_norm_stderr\": 0.027682979522960238\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8407960199004975,\n\ \ \"acc_stderr\": 0.02587064676616913,\n \"acc_norm\": 0.8407960199004975,\n\ \ \"acc_norm_stderr\": 0.02587064676616913\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.87,\n \"acc_stderr\": 0.033799766898963086,\n \ \ \"acc_norm\": 0.87,\n \"acc_norm_stderr\": 0.033799766898963086\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5421686746987951,\n\ \ \"acc_stderr\": 0.0387862677100236,\n \"acc_norm\": 0.5421686746987951,\n\ \ \"acc_norm_stderr\": 0.0387862677100236\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8011695906432749,\n \"acc_stderr\": 0.030611116557432528,\n\ \ \"acc_norm\": 0.8011695906432749,\n \"acc_norm_stderr\": 0.030611116557432528\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.6181150550795593,\n\ \ \"mc1_stderr\": 0.017008101939163495,\n \"mc2\": 0.7792411418196358,\n\ \ \"mc2_stderr\": 0.013837551363048158\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.840568271507498,\n \"acc_stderr\": 0.010288617479454764\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.6588324488248674,\n \ \ \"acc_stderr\": 0.013059111935831503\n }\n}\n```" repo_url: https://huggingface.co/vicgalle/RoleBeagle-11B leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|arc:challenge|25_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-03-01T02-00-58.489203.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|gsm8k|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hellaswag|10_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-management|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-management|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-03-01T02-00-58.489203.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-international_law|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-management|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-marketing|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-sociology|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-virology|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-03-01T02-00-58.489203.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|truthfulqa:mc|0_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-03-01T02-00-58.489203.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_03_01T02_00_58.489203 path: - '**/details_harness|winogrande|5_2024-03-01T02-00-58.489203.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-03-01T02-00-58.489203.parquet' - config_name: results data_files: - split: 2024_03_01T02_00_58.489203 path: - results_2024-03-01T02-00-58.489203.parquet - split: latest path: - results_2024-03-01T02-00-58.489203.parquet --- # Dataset Card for Evaluation run of vicgalle/RoleBeagle-11B <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [vicgalle/RoleBeagle-11B](https://huggingface.co/vicgalle/RoleBeagle-11B) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_vicgalle__RoleBeagle-11B", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-03-01T02:00:58.489203](https://huggingface.co/datasets/open-llm-leaderboard/details_vicgalle__RoleBeagle-11B/blob/main/results_2024-03-01T02-00-58.489203.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6679514377558416, "acc_stderr": 0.03170912207730401, "acc_norm": 0.6685063830324032, "acc_norm_stderr": 0.03235896712705177, "mc1": 0.6181150550795593, "mc1_stderr": 0.017008101939163495, "mc2": 0.7792411418196358, "mc2_stderr": 0.013837551363048158 }, "harness|arc:challenge|25": { "acc": 0.7030716723549488, "acc_stderr": 0.013352025976725225, "acc_norm": 0.7235494880546075, "acc_norm_stderr": 0.013069662474252425 }, "harness|hellaswag|10": { "acc": 0.7219677355108544, "acc_stderr": 0.004471137333619627, "acc_norm": 0.8977295359490142, "acc_norm_stderr": 0.0030238440318883764 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.37, "acc_stderr": 0.048523658709391, "acc_norm": 0.37, "acc_norm_stderr": 0.048523658709391 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.562962962962963, "acc_stderr": 0.04284958639753401, "acc_norm": 0.562962962962963, "acc_norm_stderr": 0.04284958639753401 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.75, "acc_stderr": 0.03523807393012047, "acc_norm": 0.75, "acc_norm_stderr": 0.03523807393012047 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.69, "acc_stderr": 0.04648231987117316, "acc_norm": 0.69, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6830188679245283, "acc_stderr": 0.028637235639800893, "acc_norm": 0.6830188679245283, "acc_norm_stderr": 0.028637235639800893 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7916666666666666, "acc_stderr": 0.03396116205845335, "acc_norm": 0.7916666666666666, "acc_norm_stderr": 0.03396116205845335 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.44, "acc_stderr": 0.04988876515698589, "acc_norm": 0.44, "acc_norm_stderr": 0.04988876515698589 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.53, "acc_stderr": 0.050161355804659205, "acc_norm": 0.53, "acc_norm_stderr": 0.050161355804659205 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.33, "acc_stderr": 0.047258156262526045, "acc_norm": 0.33, "acc_norm_stderr": 0.047258156262526045 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.653179190751445, "acc_stderr": 0.036291466701596636, "acc_norm": 0.653179190751445, "acc_norm_stderr": 0.036291466701596636 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.38235294117647056, "acc_stderr": 0.04835503696107223, "acc_norm": 0.38235294117647056, "acc_norm_stderr": 0.04835503696107223 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.75, "acc_stderr": 0.04351941398892446, "acc_norm": 0.75, "acc_norm_stderr": 0.04351941398892446 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.6340425531914894, "acc_stderr": 0.0314895582974553, "acc_norm": 0.6340425531914894, "acc_norm_stderr": 0.0314895582974553 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.543859649122807, "acc_stderr": 0.046854730419077895, "acc_norm": 0.543859649122807, "acc_norm_stderr": 0.046854730419077895 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5586206896551724, "acc_stderr": 0.04137931034482757, "acc_norm": 0.5586206896551724, "acc_norm_stderr": 0.04137931034482757 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.4947089947089947, "acc_stderr": 0.02574986828855657, "acc_norm": 0.4947089947089947, "acc_norm_stderr": 0.02574986828855657 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.3968253968253968, "acc_stderr": 0.04375888492727061, "acc_norm": 0.3968253968253968, "acc_norm_stderr": 0.04375888492727061 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.38, "acc_stderr": 0.04878317312145632, "acc_norm": 0.38, "acc_norm_stderr": 0.04878317312145632 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.8290322580645161, "acc_stderr": 0.02141724293632158, "acc_norm": 0.8290322580645161, "acc_norm_stderr": 0.02141724293632158 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.4876847290640394, "acc_stderr": 0.035169204442208966, "acc_norm": 0.4876847290640394, "acc_norm_stderr": 0.035169204442208966 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.69, "acc_stderr": 0.04648231987117316, "acc_norm": 0.69, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7818181818181819, "acc_stderr": 0.03225078108306289, "acc_norm": 0.7818181818181819, "acc_norm_stderr": 0.03225078108306289 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.8282828282828283, "acc_stderr": 0.02686971618742991, "acc_norm": 0.8282828282828283, "acc_norm_stderr": 0.02686971618742991 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.917098445595855, "acc_stderr": 0.01989934131572178, "acc_norm": 0.917098445595855, "acc_norm_stderr": 0.01989934131572178 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6692307692307692, "acc_stderr": 0.023854795680971135, "acc_norm": 0.6692307692307692, "acc_norm_stderr": 0.023854795680971135 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.34814814814814815, "acc_stderr": 0.029045600290616258, "acc_norm": 0.34814814814814815, "acc_norm_stderr": 0.029045600290616258 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.726890756302521, "acc_stderr": 0.028942004040998167, "acc_norm": 0.726890756302521, "acc_norm_stderr": 0.028942004040998167 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.4105960264900662, "acc_stderr": 0.04016689594849928, "acc_norm": 0.4105960264900662, "acc_norm_stderr": 0.04016689594849928 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8587155963302753, "acc_stderr": 0.014933868987028084, "acc_norm": 0.8587155963302753, "acc_norm_stderr": 0.014933868987028084 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5740740740740741, "acc_stderr": 0.033723432716530624, "acc_norm": 0.5740740740740741, "acc_norm_stderr": 0.033723432716530624 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.8333333333333334, "acc_stderr": 0.02615686752393104, "acc_norm": 0.8333333333333334, "acc_norm_stderr": 0.02615686752393104 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.8481012658227848, "acc_stderr": 0.02336387809663245, "acc_norm": 0.8481012658227848, "acc_norm_stderr": 0.02336387809663245 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.726457399103139, "acc_stderr": 0.029918586707798827, "acc_norm": 0.726457399103139, "acc_norm_stderr": 0.029918586707798827 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7480916030534351, "acc_stderr": 0.03807387116306086, "acc_norm": 0.7480916030534351, "acc_norm_stderr": 0.03807387116306086 }, "harness|hendrycksTest-international_law|5": { "acc": 0.8099173553719008, "acc_stderr": 0.03581796951709282, "acc_norm": 0.8099173553719008, "acc_norm_stderr": 0.03581796951709282 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.8148148148148148, "acc_stderr": 0.03755265865037182, "acc_norm": 0.8148148148148148, "acc_norm_stderr": 0.03755265865037182 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.754601226993865, "acc_stderr": 0.03380939813943354, "acc_norm": 0.754601226993865, "acc_norm_stderr": 0.03380939813943354 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.5, "acc_stderr": 0.04745789978762494, "acc_norm": 0.5, "acc_norm_stderr": 0.04745789978762494 }, "harness|hendrycksTest-management|5": { "acc": 0.8155339805825242, "acc_stderr": 0.03840423627288276, "acc_norm": 0.8155339805825242, "acc_norm_stderr": 0.03840423627288276 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8803418803418803, "acc_stderr": 0.021262719400406957, "acc_norm": 0.8803418803418803, "acc_norm_stderr": 0.021262719400406957 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.74, "acc_stderr": 0.04408440022768078, "acc_norm": 0.74, "acc_norm_stderr": 0.04408440022768078 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8212005108556832, "acc_stderr": 0.013702643715368976, "acc_norm": 0.8212005108556832, "acc_norm_stderr": 0.013702643715368976 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7254335260115607, "acc_stderr": 0.02402774515526501, "acc_norm": 0.7254335260115607, "acc_norm_stderr": 0.02402774515526501 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.4569832402234637, "acc_stderr": 0.01666049858050917, "acc_norm": 0.4569832402234637, "acc_norm_stderr": 0.01666049858050917 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.738562091503268, "acc_stderr": 0.025160998214292456, "acc_norm": 0.738562091503268, "acc_norm_stderr": 0.025160998214292456 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7556270096463023, "acc_stderr": 0.02440616209466889, "acc_norm": 0.7556270096463023, "acc_norm_stderr": 0.02440616209466889 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7685185185185185, "acc_stderr": 0.02346842983245114, "acc_norm": 0.7685185185185185, "acc_norm_stderr": 0.02346842983245114 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.5319148936170213, "acc_stderr": 0.029766675075873866, "acc_norm": 0.5319148936170213, "acc_norm_stderr": 0.029766675075873866 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.5195567144719687, "acc_stderr": 0.012760464028289295, "acc_norm": 0.5195567144719687, "acc_norm_stderr": 0.012760464028289295 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.7463235294117647, "acc_stderr": 0.026431329870789496, "acc_norm": 0.7463235294117647, "acc_norm_stderr": 0.026431329870789496 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.7173202614379085, "acc_stderr": 0.01821726955205344, "acc_norm": 0.7173202614379085, "acc_norm_stderr": 0.01821726955205344 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.7, "acc_stderr": 0.04389311454644287, "acc_norm": 0.7, "acc_norm_stderr": 0.04389311454644287 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7510204081632653, "acc_stderr": 0.027682979522960238, "acc_norm": 0.7510204081632653, "acc_norm_stderr": 0.027682979522960238 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8407960199004975, "acc_stderr": 0.02587064676616913, "acc_norm": 0.8407960199004975, "acc_norm_stderr": 0.02587064676616913 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.87, "acc_stderr": 0.033799766898963086, "acc_norm": 0.87, "acc_norm_stderr": 0.033799766898963086 }, "harness|hendrycksTest-virology|5": { "acc": 0.5421686746987951, "acc_stderr": 0.0387862677100236, "acc_norm": 0.5421686746987951, "acc_norm_stderr": 0.0387862677100236 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8011695906432749, "acc_stderr": 0.030611116557432528, "acc_norm": 0.8011695906432749, "acc_norm_stderr": 0.030611116557432528 }, "harness|truthfulqa:mc|0": { "mc1": 0.6181150550795593, "mc1_stderr": 0.017008101939163495, "mc2": 0.7792411418196358, "mc2_stderr": 0.013837551363048158 }, "harness|winogrande|5": { "acc": 0.840568271507498, "acc_stderr": 0.010288617479454764 }, "harness|gsm8k|5": { "acc": 0.6588324488248674, "acc_stderr": 0.013059111935831503 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集摘要

该数据集是在对模型 vicgalle/RoleBeagle-11B 进行评估时自动创建的,用于 Open LLM Leaderboard

数据集组成

  • 数据集包含 63 个配置,每个配置对应一个评估任务。
  • 数据集从 1 次运行中创建。每个运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新结果。
  • 额外的 "results" 配置存储所有运行结果的聚合(用于计算和显示 Open LLM Leaderboard 上的聚合指标)。

最新结果

以下是 2024-03-01T02:00:58.489203 运行的最新结果:

python { "all": { "acc": 0.6679514377558416, "acc_stderr": 0.03170912207730401, "acc_norm": 0.6685063830324032, "acc_norm_stderr": 0.03235896712705177, "mc1": 0.6181150550795593, "mc1_stderr": 0.017008101939163495, "mc2": 0.7792411418196358, "mc2_stderr": 0.013837551363048158 }, "harness|arc:challenge|25": { "acc": 0.7030716723549488, "acc_stderr": 0.013352025976725225, "acc_norm": 0.7235494880546075, "acc_norm_stderr": 0.013069662474252425 }, "harness|hellaswag|10": { "acc": 0.7219677355108544, "acc_stderr": 0.004471137333619627, "acc_norm": 0.8977295359490142, "acc_norm_stderr": 0.0030238440318883764 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.37, "acc_stderr": 0.048523658709391, "acc_norm": 0.37, "acc_norm_stderr": 0.048523658709391 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.562962962962963, "acc_stderr": 0.04284958639753401, "acc_norm": 0.562962962962963, "acc_norm_stderr": 0.04284958639753401 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.75, "acc_stderr": 0.03523807393012047, "acc_norm": 0.75, "acc_norm_stderr": 0.03523807393012047 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.69, "acc_stderr": 0.04648231987117316, "acc_norm": 0.69, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6830188679245283, "acc_stderr": 0.028637235639800893, "acc_norm": 0.6830188679245283, "acc_norm_stderr": 0.028637235639800893 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7916666666666666, "acc_stderr": 0.03396116205845335, "acc_norm": 0.7916666666666666, "acc_norm_stderr": 0.03396116205845335 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.44, "acc_stderr": 0.04988876515698589, "acc_norm": 0.44, "acc_norm_stderr": 0.04988876515698589 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.53, "acc_stderr": 0.050161355804659205, "acc_norm": 0.53, "acc_norm_stderr": 0.050161355804659205 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.33, "acc_stderr": 0.047258156262526045, "acc_norm": 0.33, "acc_norm_stderr": 0.047258156262526045 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.653179190751445, "acc_stderr": 0.036291466701596636, "acc_norm": 0.653179190751445, "acc_norm_stderr": 0.036291466701596636 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.38235294117647056, "acc_stderr": 0.04835503696107223, "acc_norm": 0.38235294117647056, "acc_norm_stderr": 0.04835503696107223 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.75, "acc_stderr": 0.04351941398892446, "acc_norm": 0.75, "acc_norm_stderr": 0.04351941398892446 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.6340425531914894, "acc_stderr": 0.0314895582974553, "acc_norm": 0.6340425531914894, "acc_norm_stderr": 0.0314895582974553 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.543859649122807, "acc_stderr": 0.046854730419077895, "acc_norm": 0.543859649122807, "acc_norm_stderr": 0.046854730419077895 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5586206896551724, "acc_stderr": 0.04137931034482757, "acc_norm": 0.5586206896551724, "acc_norm_stderr": 0.04137931034482757 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.4947089947089947, "acc_stderr": 0.02574986828855657, "acc_norm": 0.4947089947089947, "acc_norm_stderr": 0.02574986828855657 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.3968253968253968, "acc_stderr": 0.04375888492727061, "acc_norm": 0.3968253968253968, "acc_norm_stderr": 0.04375888492727061 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.38, "acc_stderr": 0.04878317312145632, "acc_norm": 0.38, "acc_norm_stderr": 0.04878317312145632 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.8290322580645161, "acc_stderr": 0.02141724293632158, "acc_norm": 0.8290322580645161, "acc_norm_stderr": 0.02141724293632158 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.4876847290640394, "acc_stderr": 0.035169204442208966, "acc_norm": 0.4876847290640394, "acc_norm_stderr": 0.035169204442208966 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.69, "acc_stderr": 0.04648231987117316, "acc_norm": 0.69, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7818181818181819, "acc_stderr": 0.03225078108306289, "acc_norm": 0.7818181818181819, "acc_norm_stderr": 0.03225078108306289 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.8282828282828283, "acc_stderr": 0.02686971618742991, "acc_norm": 0.8282828282828283, "acc_norm_stderr": 0.02686971618742991 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.917098445595855, "acc_stderr": 0.01989934131572178, "acc_norm": 0.917098445595855, "acc_norm_stderr": 0.01989934131572178 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6692307692307692, "acc_stderr": 0.023854795680971135, "acc_norm": 0.6692307692307692, "acc_norm_stderr": 0.023854795680971135 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.34814814814814815, "acc_stderr": 0.029045600290616258, "acc_norm": 0.34814814814814815, "acc_norm_stderr": 0.029045600290616258 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.726890756302521, "acc_stderr": 0.028942004040998167, "acc_norm": 0.72689075

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,该数据集作为开放大语言模型排行榜的衍生成果,其构建过程体现了自动化与标准化的特点。数据集在模型vicgalle/RoleBeagle-11B的评估运行中自动生成,涵盖了63种不同的评测任务配置,每个配置对应一项具体的评估任务。数据集的创建基于一次完整的评估运行,每次运行的结果以特定时间戳命名的分割形式存储,而“train”分割始终指向最新的评估结果。此外,一个名为“results”的额外配置汇总了所有运行的聚合结果,用于在排行榜上计算和展示模型的综合性能指标。
特点
该数据集的核心特征在于其作为模型评估记录的详实性与结构性。数据集不仅包含了模型在ARC挑战赛、HellaSwag、TruthfulQA、Winogrande及GSM8K等经典基准测试上的表现,还广泛覆盖了MMLU基准下的57个专业学科领域,如抽象代数、解剖学、天文学乃至世界宗教等,提供了模型在多样化知识维度上的细粒度性能剖析。每个任务配置均记录了准确率及其标准误差,数据以Parquet格式存储,确保了高效的数据访问与处理能力。这种多任务、多指标的架构为深入分析模型的能力边界与知识盲区提供了坚实的数据基础。
使用方法
对于希望利用此数据集的研究者,其使用方法清晰而直接。通过Hugging Face的`datasets`库,可以便捷地加载特定评估任务的详细数据。例如,使用`load_dataset`函数并指定数据集名称、任务配置(如`harness_winogrande_5`)以及分割(通常为`train`以获取最新结果),即可将相应数据载入为可供分析的数据结构。数据集的结构化设计允许用户灵活查询不同运行批次或特定学科领域的评估细节,从而支持对模型性能的横向对比与纵向追踪,为模型优化、能力评估及相关学术研究提供关键的数据支撑。
背景与挑战
背景概述
在大型语言模型(LLM)迅猛发展的背景下,HuggingFace团队于2023年推出了Open LLM Leaderboard,旨在为社区提供一个透明、标准化的模型性能评估平台。该平台通过整合多个经典基准测试,如ARC、HellaSwag、MMLU和TruthfulQA等,系统性地衡量模型在知识推理、常识理解及专业领域任务上的综合能力。数据集“open-llm-leaderboard-old/details_vicgalle__RoleBeagle-11B”正是这一评估体系的产物,它记录了模型RoleBeagle-11B于2024年3月1日在Leaderboard上的详细评测结果。该数据集的创建不仅体现了开源社区对模型可复现性与公平比较的追求,也为后续研究提供了宝贵的性能分析资料,推动了LLM评估方法的演进与优化。
当前挑战
该数据集所应对的核心挑战在于如何全面、公正地评估大型语言模型的多维度能力。传统评估往往局限于单一任务或领域,难以反映模型在复杂、开放场景下的真实表现。Open LLM Leaderboard通过集成涵盖科学、人文、伦理等57个细分领域的MMLU测试,以及常识推理(HellaSwag)、数学问题求解(GSM8K)等多样化任务,旨在构建一个跨领域、多难度的综合评估体系。在数据集构建过程中,技术挑战同样显著:需确保不同基准测试的评估协议一致,处理大规模分布式计算产生的海量结果数据,并设计自动化流水线以高效整合多次评测运行,同时维持数据版本的可追溯性与结果的可复现性。
常用场景
经典使用场景
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard的评估运行产物,其经典使用场景在于为研究者提供vicgalle/RoleBeagle-11B模型在多样化基准测试中的详细性能数据。通过涵盖ARC挑战赛、HellaSwag、MMLU以及TruthfulQA等63项任务配置,该数据集能够系统性地衡量模型在常识推理、知识问答、数学解题及真实性判断等多维度的能力表现,为模型间的横向对比与性能剖析奠定了数据基础。
衍生相关工作
围绕该数据集衍生的经典工作主要集中在模型评估框架的优化与能力归因分析。例如,基于Open LLM Leaderboard的评估体系,后续研究提出了针对模型鲁棒性、公平性及效率的补充基准。同时,该数据集常被用于驱动模型融合、知识蒸馏或针对性微调的研究,通过分析模型在不同任务上的表现差异,指导改进策略的设计。这些工作共同推动了评估生态的完善,使模型能力的度量更加全面与深入。
数据集最近研究
最新研究方向
在大型语言模型评估领域,开放大模型排行榜(Open LLM Leaderboard)已成为衡量模型性能的关键基准。该数据集记录了vicgalle/RoleBeagle-11B模型在多个前沿任务中的详细评估结果,涵盖常识推理、专业学科知识及数学能力等维度。当前研究焦点集中于利用此类细粒度评估数据,深入分析模型在特定领域(如形式逻辑、专业医学)的能力边界,探索模型规模与知识泛化之间的关联。随着多模态与推理增强技术的兴起,这些评估数据为优化模型架构、设计针对性训练策略提供了实证依据,推动了模型在复杂场景下的可靠性与适应性提升。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作