five

open-llm-leaderboard-old/details_Gille__StrangeMerges_28-7B-dare_ties

收藏
Hugging Face2024-04-02 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_Gille__StrangeMerges_28-7B-dare_ties
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of Gille/StrangeMerges_28-7B-dare_ties dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [Gille/StrangeMerges_28-7B-dare_ties](https://huggingface.co/Gille/StrangeMerges_28-7B-dare_ties)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_Gille__StrangeMerges_28-7B-dare_ties\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-04-02T23:20:41.709141](https://huggingface.co/datasets/open-llm-leaderboard/details_Gille__StrangeMerges_28-7B-dare_ties/blob/main/results_2024-04-02T23-20-41.709141.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.652549933565694,\n\ \ \"acc_stderr\": 0.03216212761875025,\n \"acc_norm\": 0.6521949973827587,\n\ \ \"acc_norm_stderr\": 0.032830802430059286,\n \"mc1\": 0.627906976744186,\n\ \ \"mc1_stderr\": 0.01692109011881403,\n \"mc2\": 0.7754925522086183,\n\ \ \"mc2_stderr\": 0.013783768613942371\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.7056313993174061,\n \"acc_stderr\": 0.013318528460539422,\n\ \ \"acc_norm\": 0.7218430034129693,\n \"acc_norm_stderr\": 0.013094469919538805\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.716391157140012,\n\ \ \"acc_stderr\": 0.004498280244494493,\n \"acc_norm\": 0.8907588129854611,\n\ \ \"acc_norm_stderr\": 0.0031130406065401238\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6296296296296297,\n\ \ \"acc_stderr\": 0.041716541613545426,\n \"acc_norm\": 0.6296296296296297,\n\ \ \"acc_norm_stderr\": 0.041716541613545426\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.7105263157894737,\n \"acc_stderr\": 0.03690677986137283,\n\ \ \"acc_norm\": 0.7105263157894737,\n \"acc_norm_stderr\": 0.03690677986137283\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.62,\n\ \ \"acc_stderr\": 0.048783173121456316,\n \"acc_norm\": 0.62,\n \ \ \"acc_norm_stderr\": 0.048783173121456316\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7018867924528301,\n \"acc_stderr\": 0.02815283794249386,\n\ \ \"acc_norm\": 0.7018867924528301,\n \"acc_norm_stderr\": 0.02815283794249386\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7708333333333334,\n\ \ \"acc_stderr\": 0.03514697467862388,\n \"acc_norm\": 0.7708333333333334,\n\ \ \"acc_norm_stderr\": 0.03514697467862388\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.53,\n \"acc_stderr\": 0.05016135580465919,\n \ \ \"acc_norm\": 0.53,\n \"acc_norm_stderr\": 0.05016135580465919\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.58,\n \"acc_stderr\": 0.049604496374885836,\n \"acc_norm\": 0.58,\n\ \ \"acc_norm_stderr\": 0.049604496374885836\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.34,\n \"acc_stderr\": 0.04760952285695235,\n \ \ \"acc_norm\": 0.34,\n \"acc_norm_stderr\": 0.04760952285695235\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.653179190751445,\n\ \ \"acc_stderr\": 0.036291466701596636,\n \"acc_norm\": 0.653179190751445,\n\ \ \"acc_norm_stderr\": 0.036291466701596636\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.3627450980392157,\n \"acc_stderr\": 0.04784060704105654,\n\ \ \"acc_norm\": 0.3627450980392157,\n \"acc_norm_stderr\": 0.04784060704105654\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.74,\n \"acc_stderr\": 0.04408440022768078,\n \"acc_norm\": 0.74,\n\ \ \"acc_norm_stderr\": 0.04408440022768078\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.548936170212766,\n \"acc_stderr\": 0.032529096196131965,\n\ \ \"acc_norm\": 0.548936170212766,\n \"acc_norm_stderr\": 0.032529096196131965\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.47368421052631576,\n\ \ \"acc_stderr\": 0.046970851366478626,\n \"acc_norm\": 0.47368421052631576,\n\ \ \"acc_norm_stderr\": 0.046970851366478626\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.593103448275862,\n \"acc_stderr\": 0.04093793981266236,\n\ \ \"acc_norm\": 0.593103448275862,\n \"acc_norm_stderr\": 0.04093793981266236\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.43386243386243384,\n \"acc_stderr\": 0.02552503438247489,\n \"\ acc_norm\": 0.43386243386243384,\n \"acc_norm_stderr\": 0.02552503438247489\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4444444444444444,\n\ \ \"acc_stderr\": 0.044444444444444495,\n \"acc_norm\": 0.4444444444444444,\n\ \ \"acc_norm_stderr\": 0.044444444444444495\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7709677419354839,\n\ \ \"acc_stderr\": 0.023904914311782655,\n \"acc_norm\": 0.7709677419354839,\n\ \ \"acc_norm_stderr\": 0.023904914311782655\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.5369458128078818,\n \"acc_stderr\": 0.035083705204426656,\n\ \ \"acc_norm\": 0.5369458128078818,\n \"acc_norm_stderr\": 0.035083705204426656\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.71,\n \"acc_stderr\": 0.045604802157206845,\n \"acc_norm\"\ : 0.71,\n \"acc_norm_stderr\": 0.045604802157206845\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7575757575757576,\n \"acc_stderr\": 0.03346409881055953,\n\ \ \"acc_norm\": 0.7575757575757576,\n \"acc_norm_stderr\": 0.03346409881055953\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.8080808080808081,\n \"acc_stderr\": 0.028057791672989017,\n \"\ acc_norm\": 0.8080808080808081,\n \"acc_norm_stderr\": 0.028057791672989017\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.8860103626943006,\n \"acc_stderr\": 0.022935144053919436,\n\ \ \"acc_norm\": 0.8860103626943006,\n \"acc_norm_stderr\": 0.022935144053919436\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6666666666666666,\n \"acc_stderr\": 0.023901157979402534,\n\ \ \"acc_norm\": 0.6666666666666666,\n \"acc_norm_stderr\": 0.023901157979402534\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.337037037037037,\n \"acc_stderr\": 0.02882088466625326,\n \ \ \"acc_norm\": 0.337037037037037,\n \"acc_norm_stderr\": 0.02882088466625326\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6638655462184874,\n \"acc_stderr\": 0.030684737115135367,\n\ \ \"acc_norm\": 0.6638655462184874,\n \"acc_norm_stderr\": 0.030684737115135367\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.41721854304635764,\n \"acc_stderr\": 0.04026141497634611,\n \"\ acc_norm\": 0.41721854304635764,\n \"acc_norm_stderr\": 0.04026141497634611\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8477064220183487,\n \"acc_stderr\": 0.015405084393157074,\n \"\ acc_norm\": 0.8477064220183487,\n \"acc_norm_stderr\": 0.015405084393157074\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.5277777777777778,\n \"acc_stderr\": 0.0340470532865388,\n \"acc_norm\"\ : 0.5277777777777778,\n \"acc_norm_stderr\": 0.0340470532865388\n },\n\ \ \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\": 0.8431372549019608,\n\ \ \"acc_stderr\": 0.025524722324553353,\n \"acc_norm\": 0.8431372549019608,\n\ \ \"acc_norm_stderr\": 0.025524722324553353\n },\n \"harness|hendrycksTest-high_school_world_history|5\"\ : {\n \"acc\": 0.810126582278481,\n \"acc_stderr\": 0.02553010046023349,\n\ \ \"acc_norm\": 0.810126582278481,\n \"acc_norm_stderr\": 0.02553010046023349\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6771300448430493,\n\ \ \"acc_stderr\": 0.031381476375754995,\n \"acc_norm\": 0.6771300448430493,\n\ \ \"acc_norm_stderr\": 0.031381476375754995\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.8015267175572519,\n \"acc_stderr\": 0.034981493854624714,\n\ \ \"acc_norm\": 0.8015267175572519,\n \"acc_norm_stderr\": 0.034981493854624714\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7520661157024794,\n \"acc_stderr\": 0.03941897526516302,\n \"\ acc_norm\": 0.7520661157024794,\n \"acc_norm_stderr\": 0.03941897526516302\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7685185185185185,\n\ \ \"acc_stderr\": 0.04077494709252627,\n \"acc_norm\": 0.7685185185185185,\n\ \ \"acc_norm_stderr\": 0.04077494709252627\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7668711656441718,\n \"acc_stderr\": 0.0332201579577674,\n\ \ \"acc_norm\": 0.7668711656441718,\n \"acc_norm_stderr\": 0.0332201579577674\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.41964285714285715,\n\ \ \"acc_stderr\": 0.04684099321077106,\n \"acc_norm\": 0.41964285714285715,\n\ \ \"acc_norm_stderr\": 0.04684099321077106\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7572815533980582,\n \"acc_stderr\": 0.04245022486384495,\n\ \ \"acc_norm\": 0.7572815533980582,\n \"acc_norm_stderr\": 0.04245022486384495\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8888888888888888,\n\ \ \"acc_stderr\": 0.020588491316092368,\n \"acc_norm\": 0.8888888888888888,\n\ \ \"acc_norm_stderr\": 0.020588491316092368\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.7,\n \"acc_stderr\": 0.046056618647183814,\n \ \ \"acc_norm\": 0.7,\n \"acc_norm_stderr\": 0.046056618647183814\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8237547892720306,\n\ \ \"acc_stderr\": 0.013625556907993462,\n \"acc_norm\": 0.8237547892720306,\n\ \ \"acc_norm_stderr\": 0.013625556907993462\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7254335260115607,\n \"acc_stderr\": 0.02402774515526502,\n\ \ \"acc_norm\": 0.7254335260115607,\n \"acc_norm_stderr\": 0.02402774515526502\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.43575418994413406,\n\ \ \"acc_stderr\": 0.016583881958602394,\n \"acc_norm\": 0.43575418994413406,\n\ \ \"acc_norm_stderr\": 0.016583881958602394\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7156862745098039,\n \"acc_stderr\": 0.025829163272757482,\n\ \ \"acc_norm\": 0.7156862745098039,\n \"acc_norm_stderr\": 0.025829163272757482\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7106109324758842,\n\ \ \"acc_stderr\": 0.02575586592263295,\n \"acc_norm\": 0.7106109324758842,\n\ \ \"acc_norm_stderr\": 0.02575586592263295\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7345679012345679,\n \"acc_stderr\": 0.024569223600460842,\n\ \ \"acc_norm\": 0.7345679012345679,\n \"acc_norm_stderr\": 0.024569223600460842\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.5035460992907801,\n \"acc_stderr\": 0.02982674915328092,\n \ \ \"acc_norm\": 0.5035460992907801,\n \"acc_norm_stderr\": 0.02982674915328092\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.46936114732724904,\n\ \ \"acc_stderr\": 0.012746237711716634,\n \"acc_norm\": 0.46936114732724904,\n\ \ \"acc_norm_stderr\": 0.012746237711716634\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.6985294117647058,\n \"acc_stderr\": 0.027875982114273168,\n\ \ \"acc_norm\": 0.6985294117647058,\n \"acc_norm_stderr\": 0.027875982114273168\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6715686274509803,\n \"acc_stderr\": 0.018999707383162673,\n \ \ \"acc_norm\": 0.6715686274509803,\n \"acc_norm_stderr\": 0.018999707383162673\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6909090909090909,\n\ \ \"acc_stderr\": 0.044262946482000985,\n \"acc_norm\": 0.6909090909090909,\n\ \ \"acc_norm_stderr\": 0.044262946482000985\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7306122448979592,\n \"acc_stderr\": 0.02840125202902294,\n\ \ \"acc_norm\": 0.7306122448979592,\n \"acc_norm_stderr\": 0.02840125202902294\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.835820895522388,\n\ \ \"acc_stderr\": 0.026193923544454115,\n \"acc_norm\": 0.835820895522388,\n\ \ \"acc_norm_stderr\": 0.026193923544454115\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.85,\n \"acc_stderr\": 0.03588702812826371,\n \ \ \"acc_norm\": 0.85,\n \"acc_norm_stderr\": 0.03588702812826371\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5783132530120482,\n\ \ \"acc_stderr\": 0.03844453181770917,\n \"acc_norm\": 0.5783132530120482,\n\ \ \"acc_norm_stderr\": 0.03844453181770917\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8245614035087719,\n \"acc_stderr\": 0.029170885500727665,\n\ \ \"acc_norm\": 0.8245614035087719,\n \"acc_norm_stderr\": 0.029170885500727665\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.627906976744186,\n\ \ \"mc1_stderr\": 0.01692109011881403,\n \"mc2\": 0.7754925522086183,\n\ \ \"mc2_stderr\": 0.013783768613942371\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.835043409629045,\n \"acc_stderr\": 0.010430917468237431\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.6815769522365428,\n \ \ \"acc_stderr\": 0.012832225723075408\n }\n}\n```" repo_url: https://huggingface.co/Gille/StrangeMerges_28-7B-dare_ties leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|arc:challenge|25_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-04-02T23-20-41.709141.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|gsm8k|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hellaswag|10_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-02T23-20-41.709141.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-management|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-virology|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-02T23-20-41.709141.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|truthfulqa:mc|0_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-04-02T23-20-41.709141.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_04_02T23_20_41.709141 path: - '**/details_harness|winogrande|5_2024-04-02T23-20-41.709141.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-04-02T23-20-41.709141.parquet' - config_name: results data_files: - split: 2024_04_02T23_20_41.709141 path: - results_2024-04-02T23-20-41.709141.parquet - split: latest path: - results_2024-04-02T23-20-41.709141.parquet --- # Dataset Card for Evaluation run of Gille/StrangeMerges_28-7B-dare_ties <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [Gille/StrangeMerges_28-7B-dare_ties](https://huggingface.co/Gille/StrangeMerges_28-7B-dare_ties) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Gille__StrangeMerges_28-7B-dare_ties", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-04-02T23:20:41.709141](https://huggingface.co/datasets/open-llm-leaderboard/details_Gille__StrangeMerges_28-7B-dare_ties/blob/main/results_2024-04-02T23-20-41.709141.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.652549933565694, "acc_stderr": 0.03216212761875025, "acc_norm": 0.6521949973827587, "acc_norm_stderr": 0.032830802430059286, "mc1": 0.627906976744186, "mc1_stderr": 0.01692109011881403, "mc2": 0.7754925522086183, "mc2_stderr": 0.013783768613942371 }, "harness|arc:challenge|25": { "acc": 0.7056313993174061, "acc_stderr": 0.013318528460539422, "acc_norm": 0.7218430034129693, "acc_norm_stderr": 0.013094469919538805 }, "harness|hellaswag|10": { "acc": 0.716391157140012, "acc_stderr": 0.004498280244494493, "acc_norm": 0.8907588129854611, "acc_norm_stderr": 0.0031130406065401238 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6296296296296297, "acc_stderr": 0.041716541613545426, "acc_norm": 0.6296296296296297, "acc_norm_stderr": 0.041716541613545426 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7105263157894737, "acc_stderr": 0.03690677986137283, "acc_norm": 0.7105263157894737, "acc_norm_stderr": 0.03690677986137283 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.62, "acc_stderr": 0.048783173121456316, "acc_norm": 0.62, "acc_norm_stderr": 0.048783173121456316 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7018867924528301, "acc_stderr": 0.02815283794249386, "acc_norm": 0.7018867924528301, "acc_norm_stderr": 0.02815283794249386 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7708333333333334, "acc_stderr": 0.03514697467862388, "acc_norm": 0.7708333333333334, "acc_norm_stderr": 0.03514697467862388 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.53, "acc_stderr": 0.05016135580465919, "acc_norm": 0.53, "acc_norm_stderr": 0.05016135580465919 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.58, "acc_stderr": 0.049604496374885836, "acc_norm": 0.58, "acc_norm_stderr": 0.049604496374885836 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.34, "acc_stderr": 0.04760952285695235, "acc_norm": 0.34, "acc_norm_stderr": 0.04760952285695235 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.653179190751445, "acc_stderr": 0.036291466701596636, "acc_norm": 0.653179190751445, "acc_norm_stderr": 0.036291466701596636 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.3627450980392157, "acc_stderr": 0.04784060704105654, "acc_norm": 0.3627450980392157, "acc_norm_stderr": 0.04784060704105654 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.74, "acc_stderr": 0.04408440022768078, "acc_norm": 0.74, "acc_norm_stderr": 0.04408440022768078 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.548936170212766, "acc_stderr": 0.032529096196131965, "acc_norm": 0.548936170212766, "acc_norm_stderr": 0.032529096196131965 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.47368421052631576, "acc_stderr": 0.046970851366478626, "acc_norm": 0.47368421052631576, "acc_norm_stderr": 0.046970851366478626 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.593103448275862, "acc_stderr": 0.04093793981266236, "acc_norm": 0.593103448275862, "acc_norm_stderr": 0.04093793981266236 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.43386243386243384, "acc_stderr": 0.02552503438247489, "acc_norm": 0.43386243386243384, "acc_norm_stderr": 0.02552503438247489 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4444444444444444, "acc_stderr": 0.044444444444444495, "acc_norm": 0.4444444444444444, "acc_norm_stderr": 0.044444444444444495 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7709677419354839, "acc_stderr": 0.023904914311782655, "acc_norm": 0.7709677419354839, "acc_norm_stderr": 0.023904914311782655 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5369458128078818, "acc_stderr": 0.035083705204426656, "acc_norm": 0.5369458128078818, "acc_norm_stderr": 0.035083705204426656 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.71, "acc_stderr": 0.045604802157206845, "acc_norm": 0.71, "acc_norm_stderr": 0.045604802157206845 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7575757575757576, "acc_stderr": 0.03346409881055953, "acc_norm": 0.7575757575757576, "acc_norm_stderr": 0.03346409881055953 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.8080808080808081, "acc_stderr": 0.028057791672989017, "acc_norm": 0.8080808080808081, "acc_norm_stderr": 0.028057791672989017 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8860103626943006, "acc_stderr": 0.022935144053919436, "acc_norm": 0.8860103626943006, "acc_norm_stderr": 0.022935144053919436 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6666666666666666, "acc_stderr": 0.023901157979402534, "acc_norm": 0.6666666666666666, "acc_norm_stderr": 0.023901157979402534 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.337037037037037, "acc_stderr": 0.02882088466625326, "acc_norm": 0.337037037037037, "acc_norm_stderr": 0.02882088466625326 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6638655462184874, "acc_stderr": 0.030684737115135367, "acc_norm": 0.6638655462184874, "acc_norm_stderr": 0.030684737115135367 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.41721854304635764, "acc_stderr": 0.04026141497634611, "acc_norm": 0.41721854304635764, "acc_norm_stderr": 0.04026141497634611 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8477064220183487, "acc_stderr": 0.015405084393157074, "acc_norm": 0.8477064220183487, "acc_norm_stderr": 0.015405084393157074 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5277777777777778, "acc_stderr": 0.0340470532865388, "acc_norm": 0.5277777777777778, "acc_norm_stderr": 0.0340470532865388 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.8431372549019608, "acc_stderr": 0.025524722324553353, "acc_norm": 0.8431372549019608, "acc_norm_stderr": 0.025524722324553353 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.810126582278481, "acc_stderr": 0.02553010046023349, "acc_norm": 0.810126582278481, "acc_norm_stderr": 0.02553010046023349 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6771300448430493, "acc_stderr": 0.031381476375754995, "acc_norm": 0.6771300448430493, "acc_norm_stderr": 0.031381476375754995 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.8015267175572519, "acc_stderr": 0.034981493854624714, "acc_norm": 0.8015267175572519, "acc_norm_stderr": 0.034981493854624714 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7520661157024794, "acc_stderr": 0.03941897526516302, "acc_norm": 0.7520661157024794, "acc_norm_stderr": 0.03941897526516302 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7685185185185185, "acc_stderr": 0.04077494709252627, "acc_norm": 0.7685185185185185, "acc_norm_stderr": 0.04077494709252627 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7668711656441718, "acc_stderr": 0.0332201579577674, "acc_norm": 0.7668711656441718, "acc_norm_stderr": 0.0332201579577674 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.41964285714285715, "acc_stderr": 0.04684099321077106, "acc_norm": 0.41964285714285715, "acc_norm_stderr": 0.04684099321077106 }, "harness|hendrycksTest-management|5": { "acc": 0.7572815533980582, "acc_stderr": 0.04245022486384495, "acc_norm": 0.7572815533980582, "acc_norm_stderr": 0.04245022486384495 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8888888888888888, "acc_stderr": 0.020588491316092368, "acc_norm": 0.8888888888888888, "acc_norm_stderr": 0.020588491316092368 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8237547892720306, "acc_stderr": 0.013625556907993462, "acc_norm": 0.8237547892720306, "acc_norm_stderr": 0.013625556907993462 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7254335260115607, "acc_stderr": 0.02402774515526502, "acc_norm": 0.7254335260115607, "acc_norm_stderr": 0.02402774515526502 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.43575418994413406, "acc_stderr": 0.016583881958602394, "acc_norm": 0.43575418994413406, "acc_norm_stderr": 0.016583881958602394 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7156862745098039, "acc_stderr": 0.025829163272757482, "acc_norm": 0.7156862745098039, "acc_norm_stderr": 0.025829163272757482 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7106109324758842, "acc_stderr": 0.02575586592263295, "acc_norm": 0.7106109324758842, "acc_norm_stderr": 0.02575586592263295 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7345679012345679, "acc_stderr": 0.024569223600460842, "acc_norm": 0.7345679012345679, "acc_norm_stderr": 0.024569223600460842 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.5035460992907801, "acc_stderr": 0.02982674915328092, "acc_norm": 0.5035460992907801, "acc_norm_stderr": 0.02982674915328092 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.46936114732724904, "acc_stderr": 0.012746237711716634, "acc_norm": 0.46936114732724904, "acc_norm_stderr": 0.012746237711716634 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6985294117647058, "acc_stderr": 0.027875982114273168, "acc_norm": 0.6985294117647058, "acc_norm_stderr": 0.027875982114273168 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6715686274509803, "acc_stderr": 0.018999707383162673, "acc_norm": 0.6715686274509803, "acc_norm_stderr": 0.018999707383162673 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6909090909090909, "acc_stderr": 0.044262946482000985, "acc_norm": 0.6909090909090909, "acc_norm_stderr": 0.044262946482000985 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7306122448979592, "acc_stderr": 0.02840125202902294, "acc_norm": 0.7306122448979592, "acc_norm_stderr": 0.02840125202902294 }, "harness|hendrycksTest-sociology|5": { "acc": 0.835820895522388, "acc_stderr": 0.026193923544454115, "acc_norm": 0.835820895522388, "acc_norm_stderr": 0.026193923544454115 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.85, "acc_stderr": 0.03588702812826371, "acc_norm": 0.85, "acc_norm_stderr": 0.03588702812826371 }, "harness|hendrycksTest-virology|5": { "acc": 0.5783132530120482, "acc_stderr": 0.03844453181770917, "acc_norm": 0.5783132530120482, "acc_norm_stderr": 0.03844453181770917 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8245614035087719, "acc_stderr": 0.029170885500727665, "acc_norm": 0.8245614035087719, "acc_norm_stderr": 0.029170885500727665 }, "harness|truthfulqa:mc|0": { "mc1": 0.627906976744186, "mc1_stderr": 0.01692109011881403, "mc2": 0.7754925522086183, "mc2_stderr": 0.013783768613942371 }, "harness|winogrande|5": { "acc": 0.835043409629045, "acc_stderr": 0.010430917468237431 }, "harness|gsm8k|5": { "acc": 0.6815769522365428, "acc_stderr": 0.012832225723075408 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集摘要

该数据集是在评估模型Gille/StrangeMerges_28-7B-dare_tiesOpen LLM Leaderboard上的运行过程中自动创建的。数据集包含63个配置,每个配置对应一个评估任务。数据集是从1次运行中创建的,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。"train"分割始终指向最新的结果。

数据集结构

数据集包含以下配置:

  • harness_arc_challenge_25
  • harness_gsm8k_5
  • harness_hellaswag_10
  • harness_hendrycksTest_5

每个配置包含多个数据文件,分为不同的分割(如2024_04_02T23_20_41.709141和latest),每个分割对应特定的数据文件路径。

最新结果

以下是2024-04-02T23:20:41.709141运行的最新结果:

python { "all": { "acc": 0.652549933565694, "acc_stderr": 0.03216212761875025, "acc_norm": 0.6521949973827587, "acc_norm_stderr": 0.032830802430059286, "mc1": 0.627906976744186, "mc1_stderr": 0.01692109011881403, "mc2": 0.7754925522086183, "mc2_stderr": 0.013783768613942371 }, "harness|arc:challenge|25": { "acc": 0.7056313993174061, "acc_stderr": 0.013318528460539422, "acc_norm": 0.7218430034129693, "acc_norm_stderr": 0.013094469919538805 }, "harness|hellaswag|10": { "acc": 0.716391157140012, "acc_stderr": 0.004498280244494493, "acc_norm": 0.8907588129854611, "acc_norm_stderr": 0.0031130406065401238 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6296296296296297, "acc_stderr": 0.041716541613545426, "acc_norm": 0.6296296296296297, "acc_norm_stderr": 0.041716541613545426 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7105263157894737, "acc_stderr": 0.03690677986137283, "acc_norm": 0.7105263157894737, "acc_norm_stderr": 0.03690677986137283 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.62, "acc_stderr": 0.048783173121456316, "acc_norm": 0.62, "acc_norm_stderr": 0.048783173121456316 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7018867924528301, "acc_stderr": 0.02815283794249386, "acc_norm": 0.7018867924528301, "acc_norm_stderr": 0.02815283794249386 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7708333333333334, "acc_stderr": 0.03514697467862388, "acc_norm": 0.7708333333333334, "acc_norm_stderr": 0.03514697467862388 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.53, "acc_stderr": 0.05016135580465919, "acc_norm": 0.53, "acc_norm_stderr": 0.05016135580465919 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.58, "acc_stderr": 0.049604496374885836, "acc_norm": 0.58, "acc_norm_stderr": 0.049604496374885836 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.34, "acc_stderr": 0.04760952285695235, "acc_norm": 0.34, "acc_norm_stderr": 0.04760952285695235 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.653179190751445, "acc_stderr": 0.036291466701596636, "acc_norm": 0.653179190751445, "acc_norm_stderr": 0.036291466701596636 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.3627450980392157, "acc_stderr": 0.04784060704105654, "acc_norm": 0.3627450980392157, "acc_norm_stderr": 0.04784060704105654 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.74, "acc_stderr": 0.04408440022768078, "acc_norm": 0.74, "acc_norm_stderr": 0.04408440022768078 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.548936170212766, "acc_stderr": 0.032529096196131965, "acc_norm": 0.548936170212766, "acc_norm_stderr": 0.032529096196131965 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.47368421052631576, "acc_stderr": 0.046970851366478626, "acc_norm": 0.47368421052631576, "acc_norm_stderr": 0.046970851366478626 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.593103448275862, "acc_stderr": 0.04093793981266236, "acc_norm": 0.593103448275862, "acc_norm_stderr": 0.04093793981266236 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.43386243386243384, "acc_stderr": 0.02552503438247489, "acc_norm": 0.43386243386243384, "acc_norm_stderr": 0.02552503438247489 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4444444444444444, "acc_stderr": 0.044444444444444495, "acc_norm": 0.4444444444444444, "acc_norm_stderr": 0.044444444444444495 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7709677419354839, "acc_stderr": 0.023904914311782655, "acc_norm": 0.7709677419354839, "acc_norm_stderr": 0.023904914311782655 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5369458128078818, "acc_stderr": 0.035083705204426656, "acc_norm": 0.5369458128078818, "acc_norm_stderr": 0.035083705204426656 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.71, "acc_stderr": 0.045604802157206845, "acc_norm": 0.71, "acc_norm_stderr": 0.045604802157206845 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7575757575757576, "acc_stderr": 0.03346409881055953, "acc_norm": 0.7575757575757576, "acc_norm_stderr": 0.03346409881055953 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.8080808080808081, "acc_stderr": 0.028057791672989017, "acc_norm": 0.8080808080808081, "acc_norm_stderr": 0.028057791672989017 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8860103626943006, "acc_stderr": 0.022935144053919436, "acc_norm": 0.8860103626943006, "acc_norm_stderr": 0.022935144053919436 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6666666666666666, "acc_stderr": 0.023901157979402534, "acc_norm": 0.6666666666666666, "acc_norm_stderr": 0.023901157979402534 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.337037037037037, "acc_stderr": 0.02882088466625326, "acc_norm": 0.337037037037037, "acc_norm_stderr": 0.02882088466625326

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是在Open LLM Leaderboard评估框架下,对模型Gille/StrangeMerges_28-7B-dare_ties进行自动化评测的过程中生成的。数据集共包含63个配置,每个配置对应一个被评估的任务,例如ARC挑战集、HellaSwag、GSM8K及涵盖多学科知识的MMLU基准测试等。每个评估运行被记录为一个独立的分片,并以运行时间戳命名,而名为'train'的分片始终指向最新一次运行的结果。此外,一个名为'results'的额外配置汇总了所有运行的综合指标,用于在排行榜上计算和展示聚合度量。
特点
该数据集的核心特征在于其结构化的多任务评估记录体系。它并非单一任务的结果集合,而是将模型在63个不同语言理解与推理任务上的表现逐一归档,每个任务均有独立的配置和详细的指标数据(如准确率及其标准误差)。这种设计使得研究者能够深入分析模型在特定领域(如高中物理、法学、医学遗传学等)的强弱项。数据集的版本控制通过时间戳分片实现,便于追踪模型性能随运行次数的演变,而'results'配置则提供了全局视角的聚合性能快照。
使用方法
用户可通过Hugging Face的datasets库便捷地加载该数据集。例如,使用`load_dataset("open-llm-leaderboard/details_Gille__StrangeMerges_28-7B-dare_ties", "harness_winogrande_5", split="train")`即可获取Winogrande任务的最新评估细节。每个配置(如'harness_arc_challenge_25')对应于一个具体的评测任务,用户可根据研究需求选择相应的配置名称。数据以Parquet格式存储,支持高效读取。此外,'results'配置下的JSON文件包含了所有任务的聚合结果,便于进行整体性能的比较与分析。
背景与挑战
背景概述
在大规模语言模型迅猛发展的时代,如何公正、全面地评估模型性能成为学术界与工业界共同关注的焦点。Open LLM Leaderboard由Hugging Face团队于2023年发起,旨在通过标准化评测体系,为开源社区提供可复现的模型性能基准。该数据集作为Leaderboard的一部分,记录了Gille/StrangeMerges_28-7B-dare_ties模型在2024年4月2日的一次完整评估结果,涵盖ARC-Challenge、HellaSwag、MMLU、TruthfulQA、Winogrande及GSM8K等63项任务配置。其核心研究问题在于:通过多维度、细粒度的评测,揭示模型在推理、常识、数学及对抗性问答等领域的真实能力,从而推动模型迭代与社区透明度。该数据集已成为评估开源大模型性能的重要参考,对模型选型与改进具有显著指导意义。
当前挑战
该数据集所解决的领域问题在于,大模型评测常因任务单一或评测标准不统一而难以反映模型全貌,Open LLM Leaderboard通过覆盖57个学科在内的多样化基准,试图弥合这一鸿沟。构建过程中面临的核心挑战包括:其一,评测任务数量庞大(63项),如何确保每项任务的数据格式、采样策略与评估指标高度一致,避免因配置差异导致结果偏差;其二,模型输出具有随机性,单次运行结果可能无法代表真实性能,需通过多次运行(如该数据集仅含一次运行)来平衡效率与可靠性;其三,海量评测结果(如MMLU子任务达57项)的存储、索引与版本管理,要求数据架构具备高扩展性与可追溯性,以支持后续模型的持续接入与横向对比。
常用场景
经典使用场景
在大语言模型蓬勃发展的时代,模型性能的公正评估成为推动技术迭代的关键基石。该数据集作为Open LLM Leaderboard评估框架的产物,专为记录StrangeMerges_28-7B-dare_ties模型在63项标准化任务上的逐项表现而设计。其核心用途在于提供细粒度的评测结果,涵盖ARC挑战赛、HellaSwag常识推理、GSM8K数学问题求解以及涵盖57个学科的MMLU基准测试等经典场景。研究者可通过加载特定任务配置与时间戳分割,精确复现模型在某一时刻的推理能力图谱,进而分析模型在知识理解、逻辑推理与数学运算等维度的优劣表现。
解决学术问题
该数据集直面大语言模型评测中普遍存在的可重复性危机与透明度缺失问题。通过系统化存储每次评估运行的完整元数据与原始得分,它有效解决了传统论文中仅汇报聚合指标而忽视细粒度结果的弊端。学术社区得以借助此数据集开展跨模型对比研究,探究不同模型架构在统一评测框架下的能力边界,并深入分析模型在特定学科(如医学、法学、物理学)上的知识覆盖盲区。其标准化格式更是为后续的元分析、偏差检测与鲁棒性研究奠定了坚实的数据基础,推动了大模型评测范式的科学化进程。
衍生相关工作
基于该数据集的结构化评测记录,衍生出了一系列关于模型合并策略与性能预测的经典研究。例如,研究者利用其中记录的DARE TIES合并方法在57个MMLU子任务上的详细得分,揭示了模型参数插值对特定学科知识保留的非线性影响,从而催生了更优的模型融合算法。此外,数据集的多任务得分分布也被用于训练代理模型,以预测新合并模型在未见任务上的表现,大幅减少了暴力搜索的计算开销。这些衍生工作不仅深化了对模型行为规律的理解,还推动了自动化模型优化工具链的成熟发展。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作