five

open-llm-leaderboard-old/details_hywu__Camelidae-8x13B

收藏
Hugging Face2024-01-10 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_hywu__Camelidae-8x13B
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of hywu/Camelidae-8x13B dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [hywu/Camelidae-8x13B](https://huggingface.co/hywu/Camelidae-8x13B) on the [Open\ \ LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_hywu__Camelidae-8x13B\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-01-10T19:10:01.237565](https://huggingface.co/datasets/open-llm-leaderboard/details_hywu__Camelidae-8x13B/blob/main/results_2024-01-10T19-10-01.237565.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.5726420089337894,\n\ \ \"acc_stderr\": 0.03341034561202174,\n \"acc_norm\": 0.5771409715156051,\n\ \ \"acc_norm_stderr\": 0.03409998451960007,\n \"mc1\": 0.3084455324357405,\n\ \ \"mc1_stderr\": 0.01616803938315687,\n \"mc2\": 0.433720225618646,\n\ \ \"mc2_stderr\": 0.014788704504997708\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.5733788395904437,\n \"acc_stderr\": 0.014453185592920293,\n\ \ \"acc_norm\": 0.6117747440273038,\n \"acc_norm_stderr\": 0.014241614207414042\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6263692491535551,\n\ \ \"acc_stderr\": 0.004827786289074844,\n \"acc_norm\": 0.8273252340171281,\n\ \ \"acc_norm_stderr\": 0.003771934042799158\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.32,\n \"acc_stderr\": 0.046882617226215034,\n \ \ \"acc_norm\": 0.32,\n \"acc_norm_stderr\": 0.046882617226215034\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.4740740740740741,\n\ \ \"acc_stderr\": 0.04313531696750574,\n \"acc_norm\": 0.4740740740740741,\n\ \ \"acc_norm_stderr\": 0.04313531696750574\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.5723684210526315,\n \"acc_stderr\": 0.04026097083296564,\n\ \ \"acc_norm\": 0.5723684210526315,\n \"acc_norm_stderr\": 0.04026097083296564\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.57,\n\ \ \"acc_stderr\": 0.049756985195624284,\n \"acc_norm\": 0.57,\n \ \ \"acc_norm_stderr\": 0.049756985195624284\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.5811320754716981,\n \"acc_stderr\": 0.030365050829115208,\n\ \ \"acc_norm\": 0.5811320754716981,\n \"acc_norm_stderr\": 0.030365050829115208\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.6041666666666666,\n\ \ \"acc_stderr\": 0.04089465449325582,\n \"acc_norm\": 0.6041666666666666,\n\ \ \"acc_norm_stderr\": 0.04089465449325582\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.39,\n \"acc_stderr\": 0.04902071300001975,\n \ \ \"acc_norm\": 0.39,\n \"acc_norm_stderr\": 0.04902071300001975\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.44,\n \"acc_stderr\": 0.04988876515698589,\n \"acc_norm\": 0.44,\n\ \ \"acc_norm_stderr\": 0.04988876515698589\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.39,\n \"acc_stderr\": 0.04902071300001974,\n \ \ \"acc_norm\": 0.39,\n \"acc_norm_stderr\": 0.04902071300001974\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.5549132947976878,\n\ \ \"acc_stderr\": 0.03789401760283647,\n \"acc_norm\": 0.5549132947976878,\n\ \ \"acc_norm_stderr\": 0.03789401760283647\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.2549019607843137,\n \"acc_stderr\": 0.04336432707993179,\n\ \ \"acc_norm\": 0.2549019607843137,\n \"acc_norm_stderr\": 0.04336432707993179\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.7,\n \"acc_stderr\": 0.046056618647183814,\n \"acc_norm\": 0.7,\n\ \ \"acc_norm_stderr\": 0.046056618647183814\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.4851063829787234,\n \"acc_stderr\": 0.032671518489247764,\n\ \ \"acc_norm\": 0.4851063829787234,\n \"acc_norm_stderr\": 0.032671518489247764\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.32456140350877194,\n\ \ \"acc_stderr\": 0.04404556157374767,\n \"acc_norm\": 0.32456140350877194,\n\ \ \"acc_norm_stderr\": 0.04404556157374767\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5172413793103449,\n \"acc_stderr\": 0.04164188720169375,\n\ \ \"acc_norm\": 0.5172413793103449,\n \"acc_norm_stderr\": 0.04164188720169375\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.3253968253968254,\n \"acc_stderr\": 0.024130158299762602,\n \"\ acc_norm\": 0.3253968253968254,\n \"acc_norm_stderr\": 0.024130158299762602\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.38095238095238093,\n\ \ \"acc_stderr\": 0.04343525428949097,\n \"acc_norm\": 0.38095238095238093,\n\ \ \"acc_norm_stderr\": 0.04343525428949097\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.4,\n \"acc_stderr\": 0.04923659639173309,\n \ \ \"acc_norm\": 0.4,\n \"acc_norm_stderr\": 0.04923659639173309\n },\n\ \ \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.6612903225806451,\n\ \ \"acc_stderr\": 0.026923446059302844,\n \"acc_norm\": 0.6612903225806451,\n\ \ \"acc_norm_stderr\": 0.026923446059302844\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.4236453201970443,\n \"acc_stderr\": 0.034767257476490364,\n\ \ \"acc_norm\": 0.4236453201970443,\n \"acc_norm_stderr\": 0.034767257476490364\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.54,\n \"acc_stderr\": 0.05009082659620332,\n \"acc_norm\"\ : 0.54,\n \"acc_norm_stderr\": 0.05009082659620332\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.703030303030303,\n \"acc_stderr\": 0.0356796977226805,\n\ \ \"acc_norm\": 0.703030303030303,\n \"acc_norm_stderr\": 0.0356796977226805\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7272727272727273,\n \"acc_stderr\": 0.03173071239071724,\n \"\ acc_norm\": 0.7272727272727273,\n \"acc_norm_stderr\": 0.03173071239071724\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.8134715025906736,\n \"acc_stderr\": 0.02811209121011748,\n\ \ \"acc_norm\": 0.8134715025906736,\n \"acc_norm_stderr\": 0.02811209121011748\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.541025641025641,\n \"acc_stderr\": 0.025265525491284295,\n \ \ \"acc_norm\": 0.541025641025641,\n \"acc_norm_stderr\": 0.025265525491284295\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.3148148148148148,\n \"acc_stderr\": 0.02831753349606647,\n \ \ \"acc_norm\": 0.3148148148148148,\n \"acc_norm_stderr\": 0.02831753349606647\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.5798319327731093,\n \"acc_stderr\": 0.03206183783236152,\n \ \ \"acc_norm\": 0.5798319327731093,\n \"acc_norm_stderr\": 0.03206183783236152\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.2913907284768212,\n \"acc_stderr\": 0.03710185726119995,\n \"\ acc_norm\": 0.2913907284768212,\n \"acc_norm_stderr\": 0.03710185726119995\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.7577981651376147,\n \"acc_stderr\": 0.018368176306598618,\n \"\ acc_norm\": 0.7577981651376147,\n \"acc_norm_stderr\": 0.018368176306598618\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.4074074074074074,\n \"acc_stderr\": 0.03350991604696042,\n \"\ acc_norm\": 0.4074074074074074,\n \"acc_norm_stderr\": 0.03350991604696042\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.7745098039215687,\n \"acc_stderr\": 0.029331162294251735,\n \"\ acc_norm\": 0.7745098039215687,\n \"acc_norm_stderr\": 0.029331162294251735\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7848101265822784,\n \"acc_stderr\": 0.026750826994676173,\n \ \ \"acc_norm\": 0.7848101265822784,\n \"acc_norm_stderr\": 0.026750826994676173\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6771300448430493,\n\ \ \"acc_stderr\": 0.03138147637575499,\n \"acc_norm\": 0.6771300448430493,\n\ \ \"acc_norm_stderr\": 0.03138147637575499\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.648854961832061,\n \"acc_stderr\": 0.04186445163013751,\n\ \ \"acc_norm\": 0.648854961832061,\n \"acc_norm_stderr\": 0.04186445163013751\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7355371900826446,\n \"acc_stderr\": 0.04026187527591207,\n \"\ acc_norm\": 0.7355371900826446,\n \"acc_norm_stderr\": 0.04026187527591207\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7314814814814815,\n\ \ \"acc_stderr\": 0.042844679680521934,\n \"acc_norm\": 0.7314814814814815,\n\ \ \"acc_norm_stderr\": 0.042844679680521934\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.6993865030674846,\n \"acc_stderr\": 0.03602511318806771,\n\ \ \"acc_norm\": 0.6993865030674846,\n \"acc_norm_stderr\": 0.03602511318806771\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.38392857142857145,\n\ \ \"acc_stderr\": 0.04616143075028547,\n \"acc_norm\": 0.38392857142857145,\n\ \ \"acc_norm_stderr\": 0.04616143075028547\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7669902912621359,\n \"acc_stderr\": 0.041858325989283136,\n\ \ \"acc_norm\": 0.7669902912621359,\n \"acc_norm_stderr\": 0.041858325989283136\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8461538461538461,\n\ \ \"acc_stderr\": 0.023636873317489288,\n \"acc_norm\": 0.8461538461538461,\n\ \ \"acc_norm_stderr\": 0.023636873317489288\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.65,\n \"acc_stderr\": 0.0479372485441102,\n \ \ \"acc_norm\": 0.65,\n \"acc_norm_stderr\": 0.0479372485441102\n },\n\ \ \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.7726692209450831,\n\ \ \"acc_stderr\": 0.014987270640946009,\n \"acc_norm\": 0.7726692209450831,\n\ \ \"acc_norm_stderr\": 0.014987270640946009\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.630057803468208,\n \"acc_stderr\": 0.02599247202930639,\n\ \ \"acc_norm\": 0.630057803468208,\n \"acc_norm_stderr\": 0.02599247202930639\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.43910614525139663,\n\ \ \"acc_stderr\": 0.016598022120580428,\n \"acc_norm\": 0.43910614525139663,\n\ \ \"acc_norm_stderr\": 0.016598022120580428\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.6339869281045751,\n \"acc_stderr\": 0.02758281141515961,\n\ \ \"acc_norm\": 0.6339869281045751,\n \"acc_norm_stderr\": 0.02758281141515961\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.6366559485530546,\n\ \ \"acc_stderr\": 0.027316847674192714,\n \"acc_norm\": 0.6366559485530546,\n\ \ \"acc_norm_stderr\": 0.027316847674192714\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.6388888888888888,\n \"acc_stderr\": 0.026725868809100793,\n\ \ \"acc_norm\": 0.6388888888888888,\n \"acc_norm_stderr\": 0.026725868809100793\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.40425531914893614,\n \"acc_stderr\": 0.029275532159704725,\n \ \ \"acc_norm\": 0.40425531914893614,\n \"acc_norm_stderr\": 0.029275532159704725\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.409387222946545,\n\ \ \"acc_stderr\": 0.012558780895570752,\n \"acc_norm\": 0.409387222946545,\n\ \ \"acc_norm_stderr\": 0.012558780895570752\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.5147058823529411,\n \"acc_stderr\": 0.03035969707904612,\n\ \ \"acc_norm\": 0.5147058823529411,\n \"acc_norm_stderr\": 0.03035969707904612\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.5686274509803921,\n \"acc_stderr\": 0.020036393768352638,\n \ \ \"acc_norm\": 0.5686274509803921,\n \"acc_norm_stderr\": 0.020036393768352638\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6636363636363637,\n\ \ \"acc_stderr\": 0.04525393596302506,\n \"acc_norm\": 0.6636363636363637,\n\ \ \"acc_norm_stderr\": 0.04525393596302506\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.6326530612244898,\n \"acc_stderr\": 0.030862144921087558,\n\ \ \"acc_norm\": 0.6326530612244898,\n \"acc_norm_stderr\": 0.030862144921087558\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.7661691542288557,\n\ \ \"acc_stderr\": 0.029929415408348384,\n \"acc_norm\": 0.7661691542288557,\n\ \ \"acc_norm_stderr\": 0.029929415408348384\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.85,\n \"acc_stderr\": 0.0358870281282637,\n \ \ \"acc_norm\": 0.85,\n \"acc_norm_stderr\": 0.0358870281282637\n },\n\ \ \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.4939759036144578,\n\ \ \"acc_stderr\": 0.03892212195333045,\n \"acc_norm\": 0.4939759036144578,\n\ \ \"acc_norm_stderr\": 0.03892212195333045\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8128654970760234,\n \"acc_stderr\": 0.02991312723236804,\n\ \ \"acc_norm\": 0.8128654970760234,\n \"acc_norm_stderr\": 0.02991312723236804\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.3084455324357405,\n\ \ \"mc1_stderr\": 0.01616803938315687,\n \"mc2\": 0.433720225618646,\n\ \ \"mc2_stderr\": 0.014788704504997708\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7734806629834254,\n \"acc_stderr\": 0.011764149054698332\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.3457164518574678,\n \ \ \"acc_stderr\": 0.013100422990441583\n }\n}\n```" repo_url: https://huggingface.co/hywu/Camelidae-8x13B leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|arc:challenge|25_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-01-10T19-10-01.237565.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|gsm8k|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hellaswag|10_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-10T19-10-01.237565.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-management|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-10T19-10-01.237565.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|truthfulqa:mc|0_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-01-10T19-10-01.237565.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_01_10T19_10_01.237565 path: - '**/details_harness|winogrande|5_2024-01-10T19-10-01.237565.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-01-10T19-10-01.237565.parquet' - config_name: results data_files: - split: 2024_01_10T19_10_01.237565 path: - results_2024-01-10T19-10-01.237565.parquet - split: latest path: - results_2024-01-10T19-10-01.237565.parquet --- # Dataset Card for Evaluation run of hywu/Camelidae-8x13B <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [hywu/Camelidae-8x13B](https://huggingface.co/hywu/Camelidae-8x13B) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_hywu__Camelidae-8x13B", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-01-10T19:10:01.237565](https://huggingface.co/datasets/open-llm-leaderboard/details_hywu__Camelidae-8x13B/blob/main/results_2024-01-10T19-10-01.237565.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.5726420089337894, "acc_stderr": 0.03341034561202174, "acc_norm": 0.5771409715156051, "acc_norm_stderr": 0.03409998451960007, "mc1": 0.3084455324357405, "mc1_stderr": 0.01616803938315687, "mc2": 0.433720225618646, "mc2_stderr": 0.014788704504997708 }, "harness|arc:challenge|25": { "acc": 0.5733788395904437, "acc_stderr": 0.014453185592920293, "acc_norm": 0.6117747440273038, "acc_norm_stderr": 0.014241614207414042 }, "harness|hellaswag|10": { "acc": 0.6263692491535551, "acc_stderr": 0.004827786289074844, "acc_norm": 0.8273252340171281, "acc_norm_stderr": 0.003771934042799158 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.4740740740740741, "acc_stderr": 0.04313531696750574, "acc_norm": 0.4740740740740741, "acc_norm_stderr": 0.04313531696750574 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.5723684210526315, "acc_stderr": 0.04026097083296564, "acc_norm": 0.5723684210526315, "acc_norm_stderr": 0.04026097083296564 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.57, "acc_stderr": 0.049756985195624284, "acc_norm": 0.57, "acc_norm_stderr": 0.049756985195624284 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.5811320754716981, "acc_stderr": 0.030365050829115208, "acc_norm": 0.5811320754716981, "acc_norm_stderr": 0.030365050829115208 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.6041666666666666, "acc_stderr": 0.04089465449325582, "acc_norm": 0.6041666666666666, "acc_norm_stderr": 0.04089465449325582 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.39, "acc_stderr": 0.04902071300001975, "acc_norm": 0.39, "acc_norm_stderr": 0.04902071300001975 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.44, "acc_stderr": 0.04988876515698589, "acc_norm": 0.44, "acc_norm_stderr": 0.04988876515698589 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.39, "acc_stderr": 0.04902071300001974, "acc_norm": 0.39, "acc_norm_stderr": 0.04902071300001974 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.5549132947976878, "acc_stderr": 0.03789401760283647, "acc_norm": 0.5549132947976878, "acc_norm_stderr": 0.03789401760283647 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.2549019607843137, "acc_stderr": 0.04336432707993179, "acc_norm": 0.2549019607843137, "acc_norm_stderr": 0.04336432707993179 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.4851063829787234, "acc_stderr": 0.032671518489247764, "acc_norm": 0.4851063829787234, "acc_norm_stderr": 0.032671518489247764 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.32456140350877194, "acc_stderr": 0.04404556157374767, "acc_norm": 0.32456140350877194, "acc_norm_stderr": 0.04404556157374767 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5172413793103449, "acc_stderr": 0.04164188720169375, "acc_norm": 0.5172413793103449, "acc_norm_stderr": 0.04164188720169375 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.3253968253968254, "acc_stderr": 0.024130158299762602, "acc_norm": 0.3253968253968254, "acc_norm_stderr": 0.024130158299762602 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.38095238095238093, "acc_stderr": 0.04343525428949097, "acc_norm": 0.38095238095238093, "acc_norm_stderr": 0.04343525428949097 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.4, "acc_stderr": 0.04923659639173309, "acc_norm": 0.4, "acc_norm_stderr": 0.04923659639173309 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.6612903225806451, "acc_stderr": 0.026923446059302844, "acc_norm": 0.6612903225806451, "acc_norm_stderr": 0.026923446059302844 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.4236453201970443, "acc_stderr": 0.034767257476490364, "acc_norm": 0.4236453201970443, "acc_norm_stderr": 0.034767257476490364 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.54, "acc_stderr": 0.05009082659620332, "acc_norm": 0.54, "acc_norm_stderr": 0.05009082659620332 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.703030303030303, "acc_stderr": 0.0356796977226805, "acc_norm": 0.703030303030303, "acc_norm_stderr": 0.0356796977226805 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7272727272727273, "acc_stderr": 0.03173071239071724, "acc_norm": 0.7272727272727273, "acc_norm_stderr": 0.03173071239071724 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8134715025906736, "acc_stderr": 0.02811209121011748, "acc_norm": 0.8134715025906736, "acc_norm_stderr": 0.02811209121011748 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.541025641025641, "acc_stderr": 0.025265525491284295, "acc_norm": 0.541025641025641, "acc_norm_stderr": 0.025265525491284295 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3148148148148148, "acc_stderr": 0.02831753349606647, "acc_norm": 0.3148148148148148, "acc_norm_stderr": 0.02831753349606647 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.5798319327731093, "acc_stderr": 0.03206183783236152, "acc_norm": 0.5798319327731093, "acc_norm_stderr": 0.03206183783236152 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.2913907284768212, "acc_stderr": 0.03710185726119995, "acc_norm": 0.2913907284768212, "acc_norm_stderr": 0.03710185726119995 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.7577981651376147, "acc_stderr": 0.018368176306598618, "acc_norm": 0.7577981651376147, "acc_norm_stderr": 0.018368176306598618 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.4074074074074074, "acc_stderr": 0.03350991604696042, "acc_norm": 0.4074074074074074, "acc_norm_stderr": 0.03350991604696042 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.7745098039215687, "acc_stderr": 0.029331162294251735, "acc_norm": 0.7745098039215687, "acc_norm_stderr": 0.029331162294251735 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7848101265822784, "acc_stderr": 0.026750826994676173, "acc_norm": 0.7848101265822784, "acc_norm_stderr": 0.026750826994676173 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6771300448430493, "acc_stderr": 0.03138147637575499, "acc_norm": 0.6771300448430493, "acc_norm_stderr": 0.03138147637575499 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.648854961832061, "acc_stderr": 0.04186445163013751, "acc_norm": 0.648854961832061, "acc_norm_stderr": 0.04186445163013751 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7355371900826446, "acc_stderr": 0.04026187527591207, "acc_norm": 0.7355371900826446, "acc_norm_stderr": 0.04026187527591207 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7314814814814815, "acc_stderr": 0.042844679680521934, "acc_norm": 0.7314814814814815, "acc_norm_stderr": 0.042844679680521934 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.6993865030674846, "acc_stderr": 0.03602511318806771, "acc_norm": 0.6993865030674846, "acc_norm_stderr": 0.03602511318806771 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.38392857142857145, "acc_stderr": 0.04616143075028547, "acc_norm": 0.38392857142857145, "acc_norm_stderr": 0.04616143075028547 }, "harness|hendrycksTest-management|5": { "acc": 0.7669902912621359, "acc_stderr": 0.041858325989283136, "acc_norm": 0.7669902912621359, "acc_norm_stderr": 0.041858325989283136 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8461538461538461, "acc_stderr": 0.023636873317489288, "acc_norm": 0.8461538461538461, "acc_norm_stderr": 0.023636873317489288 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.65, "acc_stderr": 0.0479372485441102, "acc_norm": 0.65, "acc_norm_stderr": 0.0479372485441102 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.7726692209450831, "acc_stderr": 0.014987270640946009, "acc_norm": 0.7726692209450831, "acc_norm_stderr": 0.014987270640946009 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.630057803468208, "acc_stderr": 0.02599247202930639, "acc_norm": 0.630057803468208, "acc_norm_stderr": 0.02599247202930639 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.43910614525139663, "acc_stderr": 0.016598022120580428, "acc_norm": 0.43910614525139663, "acc_norm_stderr": 0.016598022120580428 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.6339869281045751, "acc_stderr": 0.02758281141515961, "acc_norm": 0.6339869281045751, "acc_norm_stderr": 0.02758281141515961 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.6366559485530546, "acc_stderr": 0.027316847674192714, "acc_norm": 0.6366559485530546, "acc_norm_stderr": 0.027316847674192714 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.6388888888888888, "acc_stderr": 0.026725868809100793, "acc_norm": 0.6388888888888888, "acc_norm_stderr": 0.026725868809100793 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.40425531914893614, "acc_stderr": 0.029275532159704725, "acc_norm": 0.40425531914893614, "acc_norm_stderr": 0.029275532159704725 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.409387222946545, "acc_stderr": 0.012558780895570752, "acc_norm": 0.409387222946545, "acc_norm_stderr": 0.012558780895570752 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.5147058823529411, "acc_stderr": 0.03035969707904612, "acc_norm": 0.5147058823529411, "acc_norm_stderr": 0.03035969707904612 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.5686274509803921, "acc_stderr": 0.020036393768352638, "acc_norm": 0.5686274509803921, "acc_norm_stderr": 0.020036393768352638 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6636363636363637, "acc_stderr": 0.04525393596302506, "acc_norm": 0.6636363636363637, "acc_norm_stderr": 0.04525393596302506 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.6326530612244898, "acc_stderr": 0.030862144921087558, "acc_norm": 0.6326530612244898, "acc_norm_stderr": 0.030862144921087558 }, "harness|hendrycksTest-sociology|5": { "acc": 0.7661691542288557, "acc_stderr": 0.029929415408348384, "acc_norm": 0.7661691542288557, "acc_norm_stderr": 0.029929415408348384 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.85, "acc_stderr": 0.0358870281282637, "acc_norm": 0.85, "acc_norm_stderr": 0.0358870281282637 }, "harness|hendrycksTest-virology|5": { "acc": 0.4939759036144578, "acc_stderr": 0.03892212195333045, "acc_norm": 0.4939759036144578, "acc_norm_stderr": 0.03892212195333045 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8128654970760234, "acc_stderr": 0.02991312723236804, "acc_norm": 0.8128654970760234, "acc_norm_stderr": 0.02991312723236804 }, "harness|truthfulqa:mc|0": { "mc1": 0.3084455324357405, "mc1_stderr": 0.01616803938315687, "mc2": 0.433720225618646, "mc2_stderr": 0.014788704504997708 }, "harness|winogrande|5": { "acc": 0.7734806629834254, "acc_stderr": 0.011764149054698332 }, "harness|gsm8k|5": { "acc": 0.3457164518574678, "acc_stderr": 0.013100422990441583 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

该数据集是在对模型 hywu/Camelidae-8x13B 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集组成

  • 数据集包含 63 个配置,每个配置对应一个评估任务。
  • 数据集从 1 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_hywu__Camelidae-8x13B", "harness_winogrande_5", split="train")

最新结果

以下是 2024-01-10T19:10:01.237565 运行的最新结果

python { "all": { "acc": 0.5726420089337894, "acc_stderr": 0.03341034561202174, "acc_norm": 0.5771409715156051, "acc_norm_stderr": 0.03409998451960007, "mc1": 0.3084455324357405, "mc1_stderr": 0.01616803938315687, "mc2": 0.433720225618646, "mc2_stderr": 0.014788704504997708 }, "harness|arc:challenge|25": { "acc": 0.5733788395904437, "acc_stderr": 0.014453185592920293, "acc_norm": 0.6117747440273038, "acc_norm_stderr": 0.014241614207414042 }, "harness|hellaswag|10": { "acc": 0.6263692491535551, "acc_stderr": 0.004827786289074844, "acc_norm": 0.8273252340171281, "acc_norm_stderr": 0.003771934042799158 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.4740740740740741, "acc_stderr": 0.04313531696750574, "acc_norm": 0.4740740740740741, "acc_norm_stderr": 0.04313531696750574 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.5723684210526315, "acc_stderr": 0.04026097083296564, "acc_norm": 0.5723684210526315, "acc_norm_stderr": 0.04026097083296564 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.57, "acc_stderr": 0.049756985195624284, "acc_norm": 0.57, "acc_norm_stderr": 0.049756985195624284 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.5811320754716981, "acc_stderr": 0.030365050829115208, "acc_norm": 0.5811320754716981, "acc_norm_stderr": 0.030365050829115208 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.6041666666666666, "acc_stderr": 0.04089465449325582, "acc_norm": 0.6041666666666666, "acc_norm_stderr": 0.04089465449325582 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.39, "acc_stderr": 0.04902071300001975, "acc_norm": 0.39, "acc_norm_stderr": 0.04902071300001975 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.44, "acc_stderr": 0.04988876515698589, "acc_norm": 0.44, "acc_norm_stderr": 0.04988876515698589 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.39, "acc_stderr": 0.04902071300001974, "acc_norm": 0.39, "acc_norm_stderr": 0.04902071300001974 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.5549132947976878, "acc_stderr": 0.03789401760283647, "acc_norm": 0.5549132947976878, "acc_norm_stderr": 0.03789401760283647 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.2549019607843137, "acc_stderr": 0.04336432707993179, "acc_norm": 0.2549019607843137, "acc_norm_stderr": 0.04336432707993179 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.4851063829787234, "acc_stderr": 0.032671518489247764, "acc_norm": 0.4851063829787234, "acc_norm_stderr": 0.032671518489247764 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.32456140350877194, "acc_stderr": 0.04404556157374767, "acc_norm": 0.32456140350877194, "acc_norm_stderr": 0.04404556157374767 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5172413793103449, "acc_stderr": 0.04164188720169375, "acc_norm": 0.5172413793103449, "acc_norm_stderr": 0.04164188720169375 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.3253968253968254, "acc_stderr": 0.024130158299762602, "acc_norm": 0.3253968253968254, "acc_norm_stderr": 0.024130158299762602 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.38095238095238093, "acc_stderr": 0.04343525428949097, "acc_norm": 0.38095238095238093, "acc_norm_stderr": 0.04343525428949097 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.4, "acc_stderr": 0.04923659639173309, "acc_norm": 0.4, "acc_norm_stderr": 0.04923659639173309 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.6612903225806451, "acc_stderr": 0.026923446059302844, "acc_norm": 0.6612903225806451, "acc_norm_stderr": 0.026923446059302844 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.4236453201970443, "acc_stderr": 0.034767257476490364, "acc_norm": 0.4236453201970443, "acc_norm_stderr": 0.034767257476490364 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.54, "acc_stderr": 0.05009082659620332, "acc_norm": 0.54, "acc_norm_stderr": 0.05009082659620332 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.703030303030303, "acc_stderr": 0.0356796977226805, "acc_norm": 0.703030303030303, "acc_norm_stderr": 0.0356796977226805 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7272727272727273, "acc_stderr": 0.03173071239071724, "acc_norm": 0.7272727272727273, "acc_norm_stderr": 0.03173071239071724 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8134715025906736, "acc_stderr": 0.02811209121011748, "acc_norm": 0.8134715025906736, "acc_norm_stderr": 0.02811209121011748 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.541025641025641, "acc_stderr": 0.025265525491284295, "acc_norm": 0.541025641025641, "acc_norm_stderr": 0.025265525491284295 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3148148148148148, "acc_stderr": 0.02831753349606647, "acc_norm": 0.3148148148148148, "acc_norm_stderr": 0.02831753349606647 }, "harness|hendrycksTest-high_school_microeconom

搜集汇总
数据集介绍
main_image_url
构建方式
在大规模语言模型评估领域,Open LLM Leaderboard 为模型性能的量化分析提供了标准化平台。该数据集源于对 hywu/Camelidae-8x13B 模型在 Leaderboard 上的自动化评测流程,由 63 个配置项构成,每个配置对应一项被评估任务,涵盖 ARC Challenge、HellaSwag、GSM8K 等经典基准。数据集基于单次运行生成,每次运行的结果以时间戳命名作为独立分割存储,而 'train' 分割则始终指向最新评测结果。此外,一个名为 'results' 的额外配置专门汇总了本次运行的全部聚合指标,用于在 Leaderboard 界面上计算和展示模型的综合表现。
特点
该数据集具有鲜明的结构化特点,其核心在于将模型在多样化任务上的细粒度评测结果系统化组织。63 个配置分别对应不同任务,每个配置内包含基于时间戳的分割,确保了评测历史的可追溯性与最新结果的即时可访问性。尤为突出的是,数据集中不仅存储了每个任务下的原始指标(如准确率及其标准误差),还通过 'results' 配置提供了整体聚合数据,包括所有任务的平均准确率、标准化准确率以及 TruthfulQA 等特殊任务的 mc1 与 mc2 分数。这种设计使得研究者能够从宏观到微观全方位审视模型能力,既支持跨任务的横向对比,也便于深入分析特定领域的表现。
使用方法
研究者可通过 Hugging Face Datasets 库便捷地加载该数据集的特定任务与分割。例如,使用 `load_dataset` 函数指定配置名如 'harness_winogrande_5' 和分割 'train',即可获取 WinoGrande 任务的最新评测详情。若需回溯历史评测,则可通过对应时间戳的分割名加载过往数据。对于希望进行综合分析的场景,加载 'results' 配置可直接获得本次运行的所有聚合指标,包括各任务的准确率与标准误差,从而高效地评估模型在多个基准上的整体性能。这一机制为模型迭代优化和跨版本比较提供了坚实的数据基础。
背景与挑战
背景概述
在大型语言模型(LLM)能力飞速演进的背景下,如何系统、公正地评估模型的多维度性能成为学术界与工业界共同关注的核心议题。Open LLM Leaderboard由HuggingFace团队于2023年发起,旨在通过标准化评测框架为社区提供模型间横向对比的基准。该数据集记录了模型hywu/Camelidae-8x13B在2024年1月10日进行的评测运行结果,覆盖了从常识推理(如ARC-Challenge、HellaSwag)到数学求解(GSM8K)、从多学科知识(MMLU全57科目)到对抗性真实性(TruthfulQA)等63项任务配置。作为一项开源评测工具,该数据集不仅为开发者提供了模型性能的细粒度快照,更推动了LLM评估范式的透明化与可复现性,对后续模型优化与选型具有重要参考价值。
当前挑战
该数据集所解决的领域问题在于,LLM的评估长期面临任务单一化、指标不统一及结果不可复现的困境。具体挑战包括:1)评测维度的全面性——模型需在从基础推理到专业知识的广泛任务中展现能力,但不同任务难度差异悬殊,如Camelidae-8x13B在高中美国政府科目中准确率达81.3%,而在大学物理中仅25.5%,凸显了模型在复杂推理上的短板;2)构建过程中的标准化难题——数据集需整合来自不同来源的评测任务(如HendrycksTest与Winogrande),并保证各配置下数据格式、采样策略的一致性,同时处理因多次运行产生的版本管理问题;3)结果的可解释性——面对63个配置的庞杂结果,如何从聚合指标(如平均准确率57.3%)中提取有意义的模型能力图谱,仍是持续挑战。
常用场景
经典使用场景
该数据集源于对大语言模型Camelidae-8x13B在Open LLM Leaderboard上的系统性评估,涵盖了63个细粒度任务配置,包括ARC挑战赛、HellaSwag、GSM8K、Winogrande以及涵盖57个学科的MMLU基准测试等。研究者通过该数据集可复现模型在常识推理、数学求解、知识问答等维度的表现,尤其适用于对比不同规模或架构的大语言模型在标准化评估框架下的综合性能。其结构化设计支持按任务类型加载细分结果,为模型能力剖析提供了精细化的数据支撑。
衍生相关工作
该数据集推动了多项模型分析与改进工作的发展。基于其提供的细粒度评估结果,研究者开发了任务级性能归因方法,用于定位Camelidae-8x13B在抽象代数(32.0%)和大学物理(25.5%)等科学领域表现薄弱的原因。此外,该数据集催生了针对特定短板任务的微调策略研究,例如通过增强数学语料训练来提升GSM8K准确率。其结构化评估框架也被后续工作采纳,成为构建多维度大语言模型能力图谱的标准化模板。
数据集最近研究
最新研究方向
在大型语言模型评估领域,Open LLM Leaderboard已成为衡量模型综合能力的重要基准。该数据集记录了hywu/Camelidae-8x13B模型在63个任务上的详细评估结果,涵盖常识推理、数学推理、知识问答等多个维度。当前前沿研究方向聚焦于通过细粒度、多任务的评估框架,揭示模型在不同能力维度上的表现差异与局限性。其中,模型在HellaSwag、Winogrande等任务上表现优异,而在GSM8K数学推理和部分专业领域知识测试中则显露出短板,这为后续研究指出了模型在逻辑推理与专业知识迁移方面的改进空间。该数据集的发布不仅为社区提供了可复现的评估基准,也推动了模型性能透明化与标准化比较的进程,对于构建更加鲁棒、可信的通用人工智能系统具有深远意义。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作