five

open-llm-leaderboard-old/details_DreadPoor__IamSoTired-7B-slerp

收藏
Hugging Face2024-02-21 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_DreadPoor__IamSoTired-7B-slerp
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of DreadPoor/IamSoTired-7B-slerp dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [DreadPoor/IamSoTired-7B-slerp](https://huggingface.co/DreadPoor/IamSoTired-7B-slerp)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_DreadPoor__IamSoTired-7B-slerp\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-02-21T05:09:14.682836](https://huggingface.co/datasets/open-llm-leaderboard/details_DreadPoor__IamSoTired-7B-slerp/blob/main/results_2024-02-21T05-09-14.682836.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6527488946903386,\n\ \ \"acc_stderr\": 0.032060653196688445,\n \"acc_norm\": 0.6531026089924603,\n\ \ \"acc_norm_stderr\": 0.032718608980275114,\n \"mc1\": 0.4773561811505508,\n\ \ \"mc1_stderr\": 0.01748554225848965,\n \"mc2\": 0.63747395012218,\n\ \ \"mc2_stderr\": 0.015395518429717168\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.6706484641638225,\n \"acc_stderr\": 0.013734057652635474,\n\ \ \"acc_norm\": 0.6988054607508533,\n \"acc_norm_stderr\": 0.013406741767847638\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6984664409480184,\n\ \ \"acc_stderr\": 0.00457985908450079,\n \"acc_norm\": 0.871539533957379,\n\ \ \"acc_norm_stderr\": 0.0033391798350182918\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.34,\n \"acc_stderr\": 0.047609522856952365,\n \ \ \"acc_norm\": 0.34,\n \"acc_norm_stderr\": 0.047609522856952365\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6592592592592592,\n\ \ \"acc_stderr\": 0.040943762699967926,\n \"acc_norm\": 0.6592592592592592,\n\ \ \"acc_norm_stderr\": 0.040943762699967926\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.6907894736842105,\n \"acc_stderr\": 0.037610708698674805,\n\ \ \"acc_norm\": 0.6907894736842105,\n \"acc_norm_stderr\": 0.037610708698674805\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.62,\n\ \ \"acc_stderr\": 0.048783173121456316,\n \"acc_norm\": 0.62,\n \ \ \"acc_norm_stderr\": 0.048783173121456316\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7245283018867924,\n \"acc_stderr\": 0.027495663683724057,\n\ \ \"acc_norm\": 0.7245283018867924,\n \"acc_norm_stderr\": 0.027495663683724057\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.75,\n\ \ \"acc_stderr\": 0.03621034121889507,\n \"acc_norm\": 0.75,\n \ \ \"acc_norm_stderr\": 0.03621034121889507\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.44,\n \"acc_stderr\": 0.04988876515698589,\n \ \ \"acc_norm\": 0.44,\n \"acc_norm_stderr\": 0.04988876515698589\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.5,\n \"acc_stderr\": 0.050251890762960605,\n \"acc_norm\": 0.5,\n\ \ \"acc_norm_stderr\": 0.050251890762960605\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.33,\n \"acc_stderr\": 0.04725815626252604,\n \ \ \"acc_norm\": 0.33,\n \"acc_norm_stderr\": 0.04725815626252604\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6878612716763006,\n\ \ \"acc_stderr\": 0.03533133389323657,\n \"acc_norm\": 0.6878612716763006,\n\ \ \"acc_norm_stderr\": 0.03533133389323657\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.47058823529411764,\n \"acc_stderr\": 0.04966570903978529,\n\ \ \"acc_norm\": 0.47058823529411764,\n \"acc_norm_stderr\": 0.04966570903978529\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.76,\n \"acc_stderr\": 0.04292346959909283,\n \"acc_norm\": 0.76,\n\ \ \"acc_norm_stderr\": 0.04292346959909283\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.5957446808510638,\n \"acc_stderr\": 0.03208115750788684,\n\ \ \"acc_norm\": 0.5957446808510638,\n \"acc_norm_stderr\": 0.03208115750788684\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.5,\n\ \ \"acc_stderr\": 0.047036043419179864,\n \"acc_norm\": 0.5,\n \ \ \"acc_norm_stderr\": 0.047036043419179864\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5517241379310345,\n \"acc_stderr\": 0.04144311810878152,\n\ \ \"acc_norm\": 0.5517241379310345,\n \"acc_norm_stderr\": 0.04144311810878152\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.41005291005291006,\n \"acc_stderr\": 0.025331202438944433,\n \"\ acc_norm\": 0.41005291005291006,\n \"acc_norm_stderr\": 0.025331202438944433\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4603174603174603,\n\ \ \"acc_stderr\": 0.04458029125470973,\n \"acc_norm\": 0.4603174603174603,\n\ \ \"acc_norm_stderr\": 0.04458029125470973\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7806451612903226,\n\ \ \"acc_stderr\": 0.023540799358723295,\n \"acc_norm\": 0.7806451612903226,\n\ \ \"acc_norm_stderr\": 0.023540799358723295\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.5123152709359606,\n \"acc_stderr\": 0.035169204442208966,\n\ \ \"acc_norm\": 0.5123152709359606,\n \"acc_norm_stderr\": 0.035169204442208966\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.71,\n \"acc_stderr\": 0.045604802157206845,\n \"acc_norm\"\ : 0.71,\n \"acc_norm_stderr\": 0.045604802157206845\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7757575757575758,\n \"acc_stderr\": 0.03256866661681102,\n\ \ \"acc_norm\": 0.7757575757575758,\n \"acc_norm_stderr\": 0.03256866661681102\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7777777777777778,\n \"acc_stderr\": 0.02962022787479048,\n \"\ acc_norm\": 0.7777777777777778,\n \"acc_norm_stderr\": 0.02962022787479048\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.9015544041450777,\n \"acc_stderr\": 0.021500249576033484,\n\ \ \"acc_norm\": 0.9015544041450777,\n \"acc_norm_stderr\": 0.021500249576033484\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6717948717948717,\n \"acc_stderr\": 0.023807633198657266,\n\ \ \"acc_norm\": 0.6717948717948717,\n \"acc_norm_stderr\": 0.023807633198657266\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.34074074074074073,\n \"acc_stderr\": 0.028897748741131147,\n \ \ \"acc_norm\": 0.34074074074074073,\n \"acc_norm_stderr\": 0.028897748741131147\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6932773109243697,\n \"acc_stderr\": 0.029953823891887037,\n\ \ \"acc_norm\": 0.6932773109243697,\n \"acc_norm_stderr\": 0.029953823891887037\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.3509933774834437,\n \"acc_stderr\": 0.03896981964257375,\n \"\ acc_norm\": 0.3509933774834437,\n \"acc_norm_stderr\": 0.03896981964257375\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8477064220183487,\n \"acc_stderr\": 0.015405084393157074,\n \"\ acc_norm\": 0.8477064220183487,\n \"acc_norm_stderr\": 0.015405084393157074\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.5138888888888888,\n \"acc_stderr\": 0.034086558679777494,\n \"\ acc_norm\": 0.5138888888888888,\n \"acc_norm_stderr\": 0.034086558679777494\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.8382352941176471,\n \"acc_stderr\": 0.025845017986926917,\n \"\ acc_norm\": 0.8382352941176471,\n \"acc_norm_stderr\": 0.025845017986926917\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.810126582278481,\n \"acc_stderr\": 0.025530100460233494,\n \ \ \"acc_norm\": 0.810126582278481,\n \"acc_norm_stderr\": 0.025530100460233494\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.695067264573991,\n\ \ \"acc_stderr\": 0.030898610882477515,\n \"acc_norm\": 0.695067264573991,\n\ \ \"acc_norm_stderr\": 0.030898610882477515\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7786259541984732,\n \"acc_stderr\": 0.036412970813137276,\n\ \ \"acc_norm\": 0.7786259541984732,\n \"acc_norm_stderr\": 0.036412970813137276\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7851239669421488,\n \"acc_stderr\": 0.037494924487096966,\n \"\ acc_norm\": 0.7851239669421488,\n \"acc_norm_stderr\": 0.037494924487096966\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7962962962962963,\n\ \ \"acc_stderr\": 0.03893542518824847,\n \"acc_norm\": 0.7962962962962963,\n\ \ \"acc_norm_stderr\": 0.03893542518824847\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7668711656441718,\n \"acc_stderr\": 0.0332201579577674,\n\ \ \"acc_norm\": 0.7668711656441718,\n \"acc_norm_stderr\": 0.0332201579577674\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.4732142857142857,\n\ \ \"acc_stderr\": 0.047389751192741546,\n \"acc_norm\": 0.4732142857142857,\n\ \ \"acc_norm_stderr\": 0.047389751192741546\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7864077669902912,\n \"acc_stderr\": 0.040580420156460344,\n\ \ \"acc_norm\": 0.7864077669902912,\n \"acc_norm_stderr\": 0.040580420156460344\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8632478632478633,\n\ \ \"acc_stderr\": 0.022509033937077802,\n \"acc_norm\": 0.8632478632478633,\n\ \ \"acc_norm_stderr\": 0.022509033937077802\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.72,\n \"acc_stderr\": 0.04512608598542128,\n \ \ \"acc_norm\": 0.72,\n \"acc_norm_stderr\": 0.04512608598542128\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8339719029374202,\n\ \ \"acc_stderr\": 0.0133064782430663,\n \"acc_norm\": 0.8339719029374202,\n\ \ \"acc_norm_stderr\": 0.0133064782430663\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7456647398843931,\n \"acc_stderr\": 0.023445826276545543,\n\ \ \"acc_norm\": 0.7456647398843931,\n \"acc_norm_stderr\": 0.023445826276545543\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.40782122905027934,\n\ \ \"acc_stderr\": 0.016435865260914746,\n \"acc_norm\": 0.40782122905027934,\n\ \ \"acc_norm_stderr\": 0.016435865260914746\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7287581699346405,\n \"acc_stderr\": 0.025457756696667878,\n\ \ \"acc_norm\": 0.7287581699346405,\n \"acc_norm_stderr\": 0.025457756696667878\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7202572347266881,\n\ \ \"acc_stderr\": 0.02549425935069491,\n \"acc_norm\": 0.7202572347266881,\n\ \ \"acc_norm_stderr\": 0.02549425935069491\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.75,\n \"acc_stderr\": 0.02409347123262133,\n \ \ \"acc_norm\": 0.75,\n \"acc_norm_stderr\": 0.02409347123262133\n \ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"acc\"\ : 0.4929078014184397,\n \"acc_stderr\": 0.02982449855912901,\n \"\ acc_norm\": 0.4929078014184397,\n \"acc_norm_stderr\": 0.02982449855912901\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.46088657105606257,\n\ \ \"acc_stderr\": 0.012731102790504519,\n \"acc_norm\": 0.46088657105606257,\n\ \ \"acc_norm_stderr\": 0.012731102790504519\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.6838235294117647,\n \"acc_stderr\": 0.028245687391462937,\n\ \ \"acc_norm\": 0.6838235294117647,\n \"acc_norm_stderr\": 0.028245687391462937\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6633986928104575,\n \"acc_stderr\": 0.01911721391149515,\n \ \ \"acc_norm\": 0.6633986928104575,\n \"acc_norm_stderr\": 0.01911721391149515\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6818181818181818,\n\ \ \"acc_stderr\": 0.04461272175910509,\n \"acc_norm\": 0.6818181818181818,\n\ \ \"acc_norm_stderr\": 0.04461272175910509\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7346938775510204,\n \"acc_stderr\": 0.028263889943784593,\n\ \ \"acc_norm\": 0.7346938775510204,\n \"acc_norm_stderr\": 0.028263889943784593\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.845771144278607,\n\ \ \"acc_stderr\": 0.025538433368578334,\n \"acc_norm\": 0.845771144278607,\n\ \ \"acc_norm_stderr\": 0.025538433368578334\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.85,\n \"acc_stderr\": 0.0358870281282637,\n \ \ \"acc_norm\": 0.85,\n \"acc_norm_stderr\": 0.0358870281282637\n },\n\ \ \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.536144578313253,\n\ \ \"acc_stderr\": 0.038823108508905954,\n \"acc_norm\": 0.536144578313253,\n\ \ \"acc_norm_stderr\": 0.038823108508905954\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8362573099415205,\n \"acc_stderr\": 0.028380919596145866,\n\ \ \"acc_norm\": 0.8362573099415205,\n \"acc_norm_stderr\": 0.028380919596145866\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.4773561811505508,\n\ \ \"mc1_stderr\": 0.01748554225848965,\n \"mc2\": 0.63747395012218,\n\ \ \"mc2_stderr\": 0.015395518429717168\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.823993685872139,\n \"acc_stderr\": 0.010703090882320705\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.6618650492797574,\n \ \ \"acc_stderr\": 0.013030829145172208\n }\n}\n```" repo_url: https://huggingface.co/DreadPoor/IamSoTired-7B-slerp leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|arc:challenge|25_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-02-21T05-09-14.682836.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|gsm8k|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hellaswag|10_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-management|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-management|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-02-21T05-09-14.682836.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-international_law|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-management|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-marketing|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-sociology|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-virology|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-02-21T05-09-14.682836.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|truthfulqa:mc|0_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-02-21T05-09-14.682836.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_02_21T05_09_14.682836 path: - '**/details_harness|winogrande|5_2024-02-21T05-09-14.682836.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-02-21T05-09-14.682836.parquet' - config_name: results data_files: - split: 2024_02_21T05_09_14.682836 path: - results_2024-02-21T05-09-14.682836.parquet - split: latest path: - results_2024-02-21T05-09-14.682836.parquet --- # Dataset Card for Evaluation run of DreadPoor/IamSoTired-7B-slerp <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [DreadPoor/IamSoTired-7B-slerp](https://huggingface.co/DreadPoor/IamSoTired-7B-slerp) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_DreadPoor__IamSoTired-7B-slerp", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-02-21T05:09:14.682836](https://huggingface.co/datasets/open-llm-leaderboard/details_DreadPoor__IamSoTired-7B-slerp/blob/main/results_2024-02-21T05-09-14.682836.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6527488946903386, "acc_stderr": 0.032060653196688445, "acc_norm": 0.6531026089924603, "acc_norm_stderr": 0.032718608980275114, "mc1": 0.4773561811505508, "mc1_stderr": 0.01748554225848965, "mc2": 0.63747395012218, "mc2_stderr": 0.015395518429717168 }, "harness|arc:challenge|25": { "acc": 0.6706484641638225, "acc_stderr": 0.013734057652635474, "acc_norm": 0.6988054607508533, "acc_norm_stderr": 0.013406741767847638 }, "harness|hellaswag|10": { "acc": 0.6984664409480184, "acc_stderr": 0.00457985908450079, "acc_norm": 0.871539533957379, "acc_norm_stderr": 0.0033391798350182918 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.34, "acc_stderr": 0.047609522856952365, "acc_norm": 0.34, "acc_norm_stderr": 0.047609522856952365 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6592592592592592, "acc_stderr": 0.040943762699967926, "acc_norm": 0.6592592592592592, "acc_norm_stderr": 0.040943762699967926 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6907894736842105, "acc_stderr": 0.037610708698674805, "acc_norm": 0.6907894736842105, "acc_norm_stderr": 0.037610708698674805 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.62, "acc_stderr": 0.048783173121456316, "acc_norm": 0.62, "acc_norm_stderr": 0.048783173121456316 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7245283018867924, "acc_stderr": 0.027495663683724057, "acc_norm": 0.7245283018867924, "acc_norm_stderr": 0.027495663683724057 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.75, "acc_stderr": 0.03621034121889507, "acc_norm": 0.75, "acc_norm_stderr": 0.03621034121889507 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.44, "acc_stderr": 0.04988876515698589, "acc_norm": 0.44, "acc_norm_stderr": 0.04988876515698589 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.5, "acc_stderr": 0.050251890762960605, "acc_norm": 0.5, "acc_norm_stderr": 0.050251890762960605 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.33, "acc_stderr": 0.04725815626252604, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252604 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6878612716763006, "acc_stderr": 0.03533133389323657, "acc_norm": 0.6878612716763006, "acc_norm_stderr": 0.03533133389323657 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.47058823529411764, "acc_stderr": 0.04966570903978529, "acc_norm": 0.47058823529411764, "acc_norm_stderr": 0.04966570903978529 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.76, "acc_stderr": 0.04292346959909283, "acc_norm": 0.76, "acc_norm_stderr": 0.04292346959909283 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5957446808510638, "acc_stderr": 0.03208115750788684, "acc_norm": 0.5957446808510638, "acc_norm_stderr": 0.03208115750788684 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5, "acc_stderr": 0.047036043419179864, "acc_norm": 0.5, "acc_norm_stderr": 0.047036043419179864 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5517241379310345, "acc_stderr": 0.04144311810878152, "acc_norm": 0.5517241379310345, "acc_norm_stderr": 0.04144311810878152 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.41005291005291006, "acc_stderr": 0.025331202438944433, "acc_norm": 0.41005291005291006, "acc_norm_stderr": 0.025331202438944433 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4603174603174603, "acc_stderr": 0.04458029125470973, "acc_norm": 0.4603174603174603, "acc_norm_stderr": 0.04458029125470973 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7806451612903226, "acc_stderr": 0.023540799358723295, "acc_norm": 0.7806451612903226, "acc_norm_stderr": 0.023540799358723295 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5123152709359606, "acc_stderr": 0.035169204442208966, "acc_norm": 0.5123152709359606, "acc_norm_stderr": 0.035169204442208966 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.71, "acc_stderr": 0.045604802157206845, "acc_norm": 0.71, "acc_norm_stderr": 0.045604802157206845 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7757575757575758, "acc_stderr": 0.03256866661681102, "acc_norm": 0.7757575757575758, "acc_norm_stderr": 0.03256866661681102 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7777777777777778, "acc_stderr": 0.02962022787479048, "acc_norm": 0.7777777777777778, "acc_norm_stderr": 0.02962022787479048 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.9015544041450777, "acc_stderr": 0.021500249576033484, "acc_norm": 0.9015544041450777, "acc_norm_stderr": 0.021500249576033484 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6717948717948717, "acc_stderr": 0.023807633198657266, "acc_norm": 0.6717948717948717, "acc_norm_stderr": 0.023807633198657266 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.34074074074074073, "acc_stderr": 0.028897748741131147, "acc_norm": 0.34074074074074073, "acc_norm_stderr": 0.028897748741131147 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6932773109243697, "acc_stderr": 0.029953823891887037, "acc_norm": 0.6932773109243697, "acc_norm_stderr": 0.029953823891887037 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.3509933774834437, "acc_stderr": 0.03896981964257375, "acc_norm": 0.3509933774834437, "acc_norm_stderr": 0.03896981964257375 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8477064220183487, "acc_stderr": 0.015405084393157074, "acc_norm": 0.8477064220183487, "acc_norm_stderr": 0.015405084393157074 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5138888888888888, "acc_stderr": 0.034086558679777494, "acc_norm": 0.5138888888888888, "acc_norm_stderr": 0.034086558679777494 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.8382352941176471, "acc_stderr": 0.025845017986926917, "acc_norm": 0.8382352941176471, "acc_norm_stderr": 0.025845017986926917 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.810126582278481, "acc_stderr": 0.025530100460233494, "acc_norm": 0.810126582278481, "acc_norm_stderr": 0.025530100460233494 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.695067264573991, "acc_stderr": 0.030898610882477515, "acc_norm": 0.695067264573991, "acc_norm_stderr": 0.030898610882477515 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7786259541984732, "acc_stderr": 0.036412970813137276, "acc_norm": 0.7786259541984732, "acc_norm_stderr": 0.036412970813137276 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7851239669421488, "acc_stderr": 0.037494924487096966, "acc_norm": 0.7851239669421488, "acc_norm_stderr": 0.037494924487096966 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7962962962962963, "acc_stderr": 0.03893542518824847, "acc_norm": 0.7962962962962963, "acc_norm_stderr": 0.03893542518824847 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7668711656441718, "acc_stderr": 0.0332201579577674, "acc_norm": 0.7668711656441718, "acc_norm_stderr": 0.0332201579577674 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.4732142857142857, "acc_stderr": 0.047389751192741546, "acc_norm": 0.4732142857142857, "acc_norm_stderr": 0.047389751192741546 }, "harness|hendrycksTest-management|5": { "acc": 0.7864077669902912, "acc_stderr": 0.040580420156460344, "acc_norm": 0.7864077669902912, "acc_norm_stderr": 0.040580420156460344 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8632478632478633, "acc_stderr": 0.022509033937077802, "acc_norm": 0.8632478632478633, "acc_norm_stderr": 0.022509033937077802 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.72, "acc_stderr": 0.04512608598542128, "acc_norm": 0.72, "acc_norm_stderr": 0.04512608598542128 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8339719029374202, "acc_stderr": 0.0133064782430663, "acc_norm": 0.8339719029374202, "acc_norm_stderr": 0.0133064782430663 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7456647398843931, "acc_stderr": 0.023445826276545543, "acc_norm": 0.7456647398843931, "acc_norm_stderr": 0.023445826276545543 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.40782122905027934, "acc_stderr": 0.016435865260914746, "acc_norm": 0.40782122905027934, "acc_norm_stderr": 0.016435865260914746 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7287581699346405, "acc_stderr": 0.025457756696667878, "acc_norm": 0.7287581699346405, "acc_norm_stderr": 0.025457756696667878 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7202572347266881, "acc_stderr": 0.02549425935069491, "acc_norm": 0.7202572347266881, "acc_norm_stderr": 0.02549425935069491 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.75, "acc_stderr": 0.02409347123262133, "acc_norm": 0.75, "acc_norm_stderr": 0.02409347123262133 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.4929078014184397, "acc_stderr": 0.02982449855912901, "acc_norm": 0.4929078014184397, "acc_norm_stderr": 0.02982449855912901 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.46088657105606257, "acc_stderr": 0.012731102790504519, "acc_norm": 0.46088657105606257, "acc_norm_stderr": 0.012731102790504519 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6838235294117647, "acc_stderr": 0.028245687391462937, "acc_norm": 0.6838235294117647, "acc_norm_stderr": 0.028245687391462937 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6633986928104575, "acc_stderr": 0.01911721391149515, "acc_norm": 0.6633986928104575, "acc_norm_stderr": 0.01911721391149515 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6818181818181818, "acc_stderr": 0.04461272175910509, "acc_norm": 0.6818181818181818, "acc_norm_stderr": 0.04461272175910509 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7346938775510204, "acc_stderr": 0.028263889943784593, "acc_norm": 0.7346938775510204, "acc_norm_stderr": 0.028263889943784593 }, "harness|hendrycksTest-sociology|5": { "acc": 0.845771144278607, "acc_stderr": 0.025538433368578334, "acc_norm": 0.845771144278607, "acc_norm_stderr": 0.025538433368578334 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.85, "acc_stderr": 0.0358870281282637, "acc_norm": 0.85, "acc_norm_stderr": 0.0358870281282637 }, "harness|hendrycksTest-virology|5": { "acc": 0.536144578313253, "acc_stderr": 0.038823108508905954, "acc_norm": 0.536144578313253, "acc_norm_stderr": 0.038823108508905954 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8362573099415205, "acc_stderr": 0.028380919596145866, "acc_norm": 0.8362573099415205, "acc_norm_stderr": 0.028380919596145866 }, "harness|truthfulqa:mc|0": { "mc1": 0.4773561811505508, "mc1_stderr": 0.01748554225848965, "mc2": 0.63747395012218, "mc2_stderr": 0.015395518429717168 }, "harness|winogrande|5": { "acc": 0.823993685872139, "acc_stderr": 0.010703090882320705 }, "harness|gsm8k|5": { "acc": 0.6618650492797574, "acc_stderr": 0.013030829145172208 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

该数据集是在对模型 DreadPoor/IamSoTired-7B-slerp 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集组成

  • 数据集包含 63 个配置,每个配置对应一个评估任务。
  • 数据集从 1 次运行中创建。每次运行可以在每个配置中找到特定的拆分,拆分名称使用运行的时间戳。
  • "train" 拆分始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_DreadPoor__IamSoTired-7B-slerp", "harness_winogrande_5", split="train")

最新结果

以下是 2024-02-21T05:09:14.682836 运行的最新结果

python { "all": { "acc": 0.6527488946903386, "acc_stderr": 0.032060653196688445, "acc_norm": 0.6531026089924603, "acc_norm_stderr": 0.032718608980275114, "mc1": 0.4773561811505508, "mc1_stderr": 0.01748554225848965, "mc2": 0.63747395012218, "mc2_stderr": 0.015395518429717168 }, "harness|arc:challenge|25": { "acc": 0.6706484641638225, "acc_stderr": 0.013734057652635474, "acc_norm": 0.6988054607508533, "acc_norm_stderr": 0.013406741767847638 }, "harness|hellaswag|10": { "acc": 0.6984664409480184, "acc_stderr": 0.00457985908450079, "acc_norm": 0.871539533957379, "acc_norm_stderr": 0.0033391798350182918 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.34, "acc_stderr": 0.047609522856952365, "acc_norm": 0.34, "acc_norm_stderr": 0.047609522856952365 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6592592592592592, "acc_stderr": 0.040943762699967926, "acc_norm": 0.6592592592592592, "acc_norm_stderr": 0.040943762699967926 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6907894736842105, "acc_stderr": 0.037610708698674805, "acc_norm": 0.6907894736842105, "acc_norm_stderr": 0.037610708698674805 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.62, "acc_stderr": 0.048783173121456316, "acc_norm": 0.62, "acc_norm_stderr": 0.048783173121456316 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7245283018867924, "acc_stderr": 0.027495663683724057, "acc_norm": 0.7245283018867924, "acc_norm_stderr": 0.027495663683724057 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.75, "acc_stderr": 0.03621034121889507, "acc_norm": 0.75, "acc_norm_stderr": 0.03621034121889507 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.44, "acc_stderr": 0.04988876515698589, "acc_norm": 0.44, "acc_norm_stderr": 0.04988876515698589 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.5, "acc_stderr": 0.050251890762960605, "acc_norm": 0.5, "acc_norm_stderr": 0.050251890762960605 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.33, "acc_stderr": 0.04725815626252604, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252604 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6878612716763006, "acc_stderr": 0.03533133389323657, "acc_norm": 0.6878612716763006, "acc_norm_stderr": 0.03533133389323657 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.47058823529411764, "acc_stderr": 0.04966570903978529, "acc_norm": 0.47058823529411764, "acc_norm_stderr": 0.04966570903978529 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.76, "acc_stderr": 0.04292346959909283, "acc_norm": 0.76, "acc_norm_stderr": 0.04292346959909283 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5957446808510638, "acc_stderr": 0.03208115750788684, "acc_norm": 0.5957446808510638, "acc_norm_stderr": 0.03208115750788684 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5, "acc_stderr": 0.047036043419179864, "acc_norm": 0.5, "acc_norm_stderr": 0.047036043419179864 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5517241379310345, "acc_stderr": 0.04144311810878152, "acc_norm": 0.5517241379310345, "acc_norm_stderr": 0.04144311810878152 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.41005291005291006, "acc_stderr": 0.025331202438944433, "acc_norm": 0.41005291005291006, "acc_norm_stderr": 0.025331202438944433 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4603174603174603, "acc_stderr": 0.04458029125470973, "acc_norm": 0.4603174603174603, "acc_norm_stderr": 0.04458029125470973 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7806451612903226, "acc_stderr": 0.023540799358723295, "acc_norm": 0.7806451612903226, "acc_norm_stderr": 0.023540799358723295 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5123152709359606, "acc_stderr": 0.035169204442208966, "acc_norm": 0.5123152709359606, "acc_norm_stderr": 0.035169204442208966 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.71, "acc_stderr": 0.045604802157206845, "acc_norm": 0.71, "acc_norm_stderr": 0.045604802157206845 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7757575757575758, "acc_stderr": 0.03256866661681102, "acc_norm": 0.7757575757575758, "acc_norm_stderr": 0.03256866661681102 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7777777777777778, "acc_stderr": 0.02962022787479048, "acc_norm": 0.7777777777777778, "acc_norm_stderr": 0.02962022787479048 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.9015544041450777, "acc_stderr": 0.021500249576033484, "acc_norm": 0.9015544041450777, "acc_norm_stderr": 0.021500249576033484 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6717948717948717, "acc_stderr": 0.023807633198657266, "acc_norm": 0.6717948717948717, "acc_norm_stderr": 0.023807633198657266 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.34074074074074073, "acc_stderr": 0.028897748741131147, "acc_norm": 0.34074074074074073, "acc_norm_stderr": 0.028897748741131147 }, "harness|hendrycksTest-high

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,Open LLM Leaderboard 作为重要的基准平台,其评估过程催生了该数据集的诞生。该数据集专为记录 DreadPoor/IamSoTired-7B-slerp 模型的评估结果而构建,通过自动化流程从单次运行中生成。数据集包含63个配置,每个配置对应一项被评估的任务,例如 ARC Challenge、HellaSwag、GSM8K 等。每次评估运行的结果被存储为特定分割,并以运行时间戳命名,其中“train”分割始终指向最新结果。此外,一个名为“results”的独立配置汇总了所有运行的综合指标,用于在 Leaderboard 上展示聚合性能。数据以 Parquet 格式存储,确保高效存取。
特点
该数据集的核心特点在于其结构化的评估记录与透明化的指标呈现。它涵盖了广泛的自然语言理解与推理任务,从常识推理(如 Winogrande)到数学问题求解(如 GSM8K),再到多领域知识测试(如 HendrycksTest 系列),全面反映了模型在多样场景下的表现。每个任务均提供精确的 accuracy 及其标准误差,以及归一化后的指标,增强了评估结果的可靠性。数据集的版本管理通过时间戳分割实现,便于追踪模型性能的演变。此外,所有结果均以 JSON 格式汇总,包含诸如 mc1、mc2 等 TruthfulQA 特有指标,为研究者提供了深入分析模型诚实性与准确性的素材。
使用方法
研究者可通过 Hugging Face Datasets 库便捷地加载该数据集进行深入分析。例如,使用 `load_dataset` 函数指定数据集名称和任务配置(如 `harness_winogrande_5`),即可获取特定任务的评估细节。通过选择 `split="train"` 可访问最新结果,而使用时间戳分割则可回溯历史运行数据。对于需要整体概览的场景,可直接加载 `results` 配置,获取所有任务的聚合指标,便于快速比较模型在不同基准上的表现。该数据集适用于模型性能复现、跨模型对比以及评估流程的自动化分析,为开源社区提供了标准化的评估数据资源。
背景与挑战
背景概述
在大型语言模型(LLM)迅猛发展的浪潮中,如何系统性地评估模型在多样任务上的综合能力成为核心挑战。Open LLM Leaderboard由Hugging Face团队于2023年发起,旨在构建一个公开、可复现的模型评测基准,覆盖推理、常识、数学及多领域知识等维度。DreadPoor/IamSoTired-7B-slerp作为参评模型之一,其评估结果被收录于专用数据集中,该数据集由Clementine等人主导创建,存储了63个任务配置下的详细性能指标。该数据集不仅记录了模型在ARC、HellaSwag、GSM8K等经典基准上的表现,还囊括了涵盖从抽象代数到病毒学的57项MMLU子任务,为研究者提供了横向对比模型能力与追踪迭代进化的宝贵资源,对推动开源LLM生态的透明化评测具有重要影响。
当前挑战
该数据集所解决的领域问题在于为LLM提供标准化、多维度的评估框架,以克服传统单一基准难以全面反映模型真实能力的局限。具体挑战包括:1)模型在MMLU的抽象代数(34%)、大学数学(33%)等逻辑密集型任务上表现薄弱,揭示了7B参数级别模型在复杂推理与专业领域知识上的瓶颈;2)构建过程中需处理63个异构任务的评测结果聚合,确保不同任务(如生成式GSM8K与判别式Winogrande)的指标可比性,并维护随时间演进的多次运行版本(如2024-02-21的评测),对数据一致性提出严苛要求;3)面对模型涌现的偏见或幻觉(如TruthfulQA中仅47.7%的MC1准确率),如何设计评测集以捕捉此类风险仍是一大难题。
常用场景
经典使用场景
该数据集源自Open LLM Leaderboard对DreadPoor/IamSoTired-7B-slerp模型的自动评估流程,涵盖了63个配置项,分别对应ARC-Challenge、HellaSwag、MMLU(涵盖抽象代数、解剖学、天文学等57个学科子集)、TruthfulQA、Winogrande和GSM8K等经典基准任务。其经典使用场景在于为研究者提供一个标准化、细粒度的模型性能分析框架,通过加载特定任务配置(如harness_winogrande_5)与时间戳分割,可精准复现模型在各项能力上的表现,从而支撑模型对比、鲁棒性验证与消融实验。
实际应用
在实际应用中,该数据集可作为模型选型与部署前质量评估的核心工具。开发者可通过分析harness_arc_challenge_25等配置下的细粒度结果,快速定位模型在复杂推理或特定学科知识上的短板,从而指导微调策略或模型融合方案的选择。同时,其结构化数据格式(Parquet)便于集成至自动化评测流水线,支持持续集成场景下对模型版本迭代的性能监控,显著降低了人工复现评估结果的时间成本。
衍生相关工作
该数据集衍生了多项推动评估方法论演进的工作。例如,基于其多任务配置结构,研究者开发了针对模型能力轮廓的可视化工具,用于生成雷达图以直观展示模型在推理、知识、数学等维度的优劣。此外,其时间戳分割机制启发了动态评估框架的设计,使得模型在不同训练阶段或量化版本下的性能退化追踪成为可能。这些衍生工作进一步强化了评估数据作为模型发展史档案的价值,为社区提供了从宏观排行榜到微观任务细节的完整分析链路。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作