five

open-llm-leaderboard-old/details_cloudyu__Mixtral_7Bx4_MOE_24B

收藏
Hugging Face2023-12-23 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_cloudyu__Mixtral_7Bx4_MOE_24B
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of cloudyu/Mixtral_7Bx4_MOE_24B dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [cloudyu/Mixtral_7Bx4_MOE_24B](https://huggingface.co/cloudyu/Mixtral_7Bx4_MOE_24B)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_cloudyu__Mixtral_7Bx4_MOE_24B\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-12-23T18:05:51.243288](https://huggingface.co/datasets/open-llm-leaderboard/details_cloudyu__Mixtral_7Bx4_MOE_24B/blob/main/results_2023-12-23T18-05-51.243288.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6322199879229019,\n\ \ \"acc_stderr\": 0.03229738563088343,\n \"acc_norm\": 0.6337436892396372,\n\ \ \"acc_norm_stderr\": 0.03294310301937023,\n \"mc1\": 0.423500611995104,\n\ \ \"mc1_stderr\": 0.017297421448534727,\n \"mc2\": 0.5978275429044729,\n\ \ \"mc2_stderr\": 0.015733742788933292\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.6143344709897611,\n \"acc_stderr\": 0.014224250973257187,\n\ \ \"acc_norm\": 0.6535836177474402,\n \"acc_norm_stderr\": 0.013905011180063232\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6683927504481179,\n\ \ \"acc_stderr\": 0.004698285350019217,\n \"acc_norm\": 0.852320254929297,\n\ \ \"acc_norm_stderr\": 0.0035405716545956313\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.3,\n \"acc_stderr\": 0.046056618647183814,\n \ \ \"acc_norm\": 0.3,\n \"acc_norm_stderr\": 0.046056618647183814\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6148148148148148,\n\ \ \"acc_stderr\": 0.04203921040156279,\n \"acc_norm\": 0.6148148148148148,\n\ \ \"acc_norm_stderr\": 0.04203921040156279\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.6578947368421053,\n \"acc_stderr\": 0.03860731599316092,\n\ \ \"acc_norm\": 0.6578947368421053,\n \"acc_norm_stderr\": 0.03860731599316092\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.59,\n\ \ \"acc_stderr\": 0.049431107042371025,\n \"acc_norm\": 0.59,\n \ \ \"acc_norm_stderr\": 0.049431107042371025\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.6981132075471698,\n \"acc_stderr\": 0.028254200344438655,\n\ \ \"acc_norm\": 0.6981132075471698,\n \"acc_norm_stderr\": 0.028254200344438655\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7361111111111112,\n\ \ \"acc_stderr\": 0.03685651095897532,\n \"acc_norm\": 0.7361111111111112,\n\ \ \"acc_norm_stderr\": 0.03685651095897532\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.43,\n \"acc_stderr\": 0.04975698519562428,\n \ \ \"acc_norm\": 0.43,\n \"acc_norm_stderr\": 0.04975698519562428\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.53,\n \"acc_stderr\": 0.050161355804659205,\n \"acc_norm\": 0.53,\n\ \ \"acc_norm_stderr\": 0.050161355804659205\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6416184971098265,\n\ \ \"acc_stderr\": 0.036563436533531585,\n \"acc_norm\": 0.6416184971098265,\n\ \ \"acc_norm_stderr\": 0.036563436533531585\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.37254901960784315,\n \"acc_stderr\": 0.04810840148082636,\n\ \ \"acc_norm\": 0.37254901960784315,\n \"acc_norm_stderr\": 0.04810840148082636\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.74,\n \"acc_stderr\": 0.04408440022768078,\n \"acc_norm\": 0.74,\n\ \ \"acc_norm_stderr\": 0.04408440022768078\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.5787234042553191,\n \"acc_stderr\": 0.03227834510146268,\n\ \ \"acc_norm\": 0.5787234042553191,\n \"acc_norm_stderr\": 0.03227834510146268\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.4824561403508772,\n\ \ \"acc_stderr\": 0.04700708033551038,\n \"acc_norm\": 0.4824561403508772,\n\ \ \"acc_norm_stderr\": 0.04700708033551038\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5241379310344828,\n \"acc_stderr\": 0.0416180850350153,\n\ \ \"acc_norm\": 0.5241379310344828,\n \"acc_norm_stderr\": 0.0416180850350153\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.42063492063492064,\n \"acc_stderr\": 0.02542483508692401,\n \"\ acc_norm\": 0.42063492063492064,\n \"acc_norm_stderr\": 0.02542483508692401\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.38095238095238093,\n\ \ \"acc_stderr\": 0.043435254289490965,\n \"acc_norm\": 0.38095238095238093,\n\ \ \"acc_norm_stderr\": 0.043435254289490965\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7645161290322581,\n\ \ \"acc_stderr\": 0.024137632429337714,\n \"acc_norm\": 0.7645161290322581,\n\ \ \"acc_norm_stderr\": 0.024137632429337714\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.4975369458128079,\n \"acc_stderr\": 0.03517945038691063,\n\ \ \"acc_norm\": 0.4975369458128079,\n \"acc_norm_stderr\": 0.03517945038691063\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.7,\n \"acc_stderr\": 0.046056618647183814,\n \"acc_norm\"\ : 0.7,\n \"acc_norm_stderr\": 0.046056618647183814\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7696969696969697,\n \"acc_stderr\": 0.0328766675860349,\n\ \ \"acc_norm\": 0.7696969696969697,\n \"acc_norm_stderr\": 0.0328766675860349\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7575757575757576,\n \"acc_stderr\": 0.030532892233932022,\n \"\ acc_norm\": 0.7575757575757576,\n \"acc_norm_stderr\": 0.030532892233932022\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.8808290155440415,\n \"acc_stderr\": 0.023381935348121437,\n\ \ \"acc_norm\": 0.8808290155440415,\n \"acc_norm_stderr\": 0.023381935348121437\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6435897435897436,\n \"acc_stderr\": 0.02428314052946731,\n \ \ \"acc_norm\": 0.6435897435897436,\n \"acc_norm_stderr\": 0.02428314052946731\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.32222222222222224,\n \"acc_stderr\": 0.02849346509102859,\n \ \ \"acc_norm\": 0.32222222222222224,\n \"acc_norm_stderr\": 0.02849346509102859\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6722689075630253,\n \"acc_stderr\": 0.03048991141767323,\n \ \ \"acc_norm\": 0.6722689075630253,\n \"acc_norm_stderr\": 0.03048991141767323\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.31788079470198677,\n \"acc_stderr\": 0.038020397601079024,\n \"\ acc_norm\": 0.31788079470198677,\n \"acc_norm_stderr\": 0.038020397601079024\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8293577981651377,\n \"acc_stderr\": 0.016129271025099857,\n \"\ acc_norm\": 0.8293577981651377,\n \"acc_norm_stderr\": 0.016129271025099857\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.5092592592592593,\n \"acc_stderr\": 0.034093869469927006,\n \"\ acc_norm\": 0.5092592592592593,\n \"acc_norm_stderr\": 0.034093869469927006\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.7990196078431373,\n \"acc_stderr\": 0.028125972265654373,\n \"\ acc_norm\": 0.7990196078431373,\n \"acc_norm_stderr\": 0.028125972265654373\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7890295358649789,\n \"acc_stderr\": 0.02655837250266192,\n \ \ \"acc_norm\": 0.7890295358649789,\n \"acc_norm_stderr\": 0.02655837250266192\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6860986547085202,\n\ \ \"acc_stderr\": 0.031146796482972465,\n \"acc_norm\": 0.6860986547085202,\n\ \ \"acc_norm_stderr\": 0.031146796482972465\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7633587786259542,\n \"acc_stderr\": 0.03727673575596913,\n\ \ \"acc_norm\": 0.7633587786259542,\n \"acc_norm_stderr\": 0.03727673575596913\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.8181818181818182,\n \"acc_stderr\": 0.03520893951097653,\n \"\ acc_norm\": 0.8181818181818182,\n \"acc_norm_stderr\": 0.03520893951097653\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7685185185185185,\n\ \ \"acc_stderr\": 0.04077494709252626,\n \"acc_norm\": 0.7685185185185185,\n\ \ \"acc_norm_stderr\": 0.04077494709252626\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7791411042944786,\n \"acc_stderr\": 0.03259177392742179,\n\ \ \"acc_norm\": 0.7791411042944786,\n \"acc_norm_stderr\": 0.03259177392742179\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.44642857142857145,\n\ \ \"acc_stderr\": 0.04718471485219588,\n \"acc_norm\": 0.44642857142857145,\n\ \ \"acc_norm_stderr\": 0.04718471485219588\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7961165048543689,\n \"acc_stderr\": 0.039891398595317706,\n\ \ \"acc_norm\": 0.7961165048543689,\n \"acc_norm_stderr\": 0.039891398595317706\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8931623931623932,\n\ \ \"acc_stderr\": 0.020237149008990932,\n \"acc_norm\": 0.8931623931623932,\n\ \ \"acc_norm_stderr\": 0.020237149008990932\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.69,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.69,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8288633461047255,\n\ \ \"acc_stderr\": 0.0134682016140663,\n \"acc_norm\": 0.8288633461047255,\n\ \ \"acc_norm_stderr\": 0.0134682016140663\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7167630057803468,\n \"acc_stderr\": 0.024257901705323374,\n\ \ \"acc_norm\": 0.7167630057803468,\n \"acc_norm_stderr\": 0.024257901705323374\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.35977653631284917,\n\ \ \"acc_stderr\": 0.016051419760310267,\n \"acc_norm\": 0.35977653631284917,\n\ \ \"acc_norm_stderr\": 0.016051419760310267\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.6993464052287581,\n \"acc_stderr\": 0.026256053835718964,\n\ \ \"acc_norm\": 0.6993464052287581,\n \"acc_norm_stderr\": 0.026256053835718964\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.6977491961414791,\n\ \ \"acc_stderr\": 0.02608270069539966,\n \"acc_norm\": 0.6977491961414791,\n\ \ \"acc_norm_stderr\": 0.02608270069539966\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7067901234567902,\n \"acc_stderr\": 0.02532988817190093,\n\ \ \"acc_norm\": 0.7067901234567902,\n \"acc_norm_stderr\": 0.02532988817190093\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.45390070921985815,\n \"acc_stderr\": 0.02970045324729146,\n \ \ \"acc_norm\": 0.45390070921985815,\n \"acc_norm_stderr\": 0.02970045324729146\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.45371577574967403,\n\ \ \"acc_stderr\": 0.012715404841277743,\n \"acc_norm\": 0.45371577574967403,\n\ \ \"acc_norm_stderr\": 0.012715404841277743\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.6544117647058824,\n \"acc_stderr\": 0.028888193103988633,\n\ \ \"acc_norm\": 0.6544117647058824,\n \"acc_norm_stderr\": 0.028888193103988633\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6437908496732027,\n \"acc_stderr\": 0.0193733324207245,\n \ \ \"acc_norm\": 0.6437908496732027,\n \"acc_norm_stderr\": 0.0193733324207245\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6727272727272727,\n\ \ \"acc_stderr\": 0.0449429086625209,\n \"acc_norm\": 0.6727272727272727,\n\ \ \"acc_norm_stderr\": 0.0449429086625209\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.746938775510204,\n \"acc_stderr\": 0.027833023871399673,\n\ \ \"acc_norm\": 0.746938775510204,\n \"acc_norm_stderr\": 0.027833023871399673\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.845771144278607,\n\ \ \"acc_stderr\": 0.02553843336857833,\n \"acc_norm\": 0.845771144278607,\n\ \ \"acc_norm_stderr\": 0.02553843336857833\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.85,\n \"acc_stderr\": 0.035887028128263686,\n \ \ \"acc_norm\": 0.85,\n \"acc_norm_stderr\": 0.035887028128263686\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5180722891566265,\n\ \ \"acc_stderr\": 0.03889951252827216,\n \"acc_norm\": 0.5180722891566265,\n\ \ \"acc_norm_stderr\": 0.03889951252827216\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8245614035087719,\n \"acc_stderr\": 0.029170885500727665,\n\ \ \"acc_norm\": 0.8245614035087719,\n \"acc_norm_stderr\": 0.029170885500727665\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.423500611995104,\n\ \ \"mc1_stderr\": 0.017297421448534727,\n \"mc2\": 0.5978275429044729,\n\ \ \"mc2_stderr\": 0.015733742788933292\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7805840568271507,\n \"acc_stderr\": 0.01163126836060778\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.6171341925701289,\n \ \ \"acc_stderr\": 0.013389223491820474\n }\n}\n```" repo_url: https://huggingface.co/cloudyu/Mixtral_7Bx4_MOE_24B leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|arc:challenge|25_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-12-23T18-05-51.243288.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|gsm8k|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hellaswag|10_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-management|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-management|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-12-23T18-05-51.243288.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-international_law|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-management|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-marketing|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-sociology|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-virology|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-12-23T18-05-51.243288.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|truthfulqa:mc|0_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-12-23T18-05-51.243288.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_12_23T18_05_51.243288 path: - '**/details_harness|winogrande|5_2023-12-23T18-05-51.243288.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-12-23T18-05-51.243288.parquet' - config_name: results data_files: - split: 2023_12_23T18_05_51.243288 path: - results_2023-12-23T18-05-51.243288.parquet - split: latest path: - results_2023-12-23T18-05-51.243288.parquet --- # Dataset Card for Evaluation run of cloudyu/Mixtral_7Bx4_MOE_24B <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [cloudyu/Mixtral_7Bx4_MOE_24B](https://huggingface.co/cloudyu/Mixtral_7Bx4_MOE_24B) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_cloudyu__Mixtral_7Bx4_MOE_24B", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-12-23T18:05:51.243288](https://huggingface.co/datasets/open-llm-leaderboard/details_cloudyu__Mixtral_7Bx4_MOE_24B/blob/main/results_2023-12-23T18-05-51.243288.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6322199879229019, "acc_stderr": 0.03229738563088343, "acc_norm": 0.6337436892396372, "acc_norm_stderr": 0.03294310301937023, "mc1": 0.423500611995104, "mc1_stderr": 0.017297421448534727, "mc2": 0.5978275429044729, "mc2_stderr": 0.015733742788933292 }, "harness|arc:challenge|25": { "acc": 0.6143344709897611, "acc_stderr": 0.014224250973257187, "acc_norm": 0.6535836177474402, "acc_norm_stderr": 0.013905011180063232 }, "harness|hellaswag|10": { "acc": 0.6683927504481179, "acc_stderr": 0.004698285350019217, "acc_norm": 0.852320254929297, "acc_norm_stderr": 0.0035405716545956313 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6148148148148148, "acc_stderr": 0.04203921040156279, "acc_norm": 0.6148148148148148, "acc_norm_stderr": 0.04203921040156279 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6578947368421053, "acc_stderr": 0.03860731599316092, "acc_norm": 0.6578947368421053, "acc_norm_stderr": 0.03860731599316092 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.59, "acc_stderr": 0.049431107042371025, "acc_norm": 0.59, "acc_norm_stderr": 0.049431107042371025 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6981132075471698, "acc_stderr": 0.028254200344438655, "acc_norm": 0.6981132075471698, "acc_norm_stderr": 0.028254200344438655 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7361111111111112, "acc_stderr": 0.03685651095897532, "acc_norm": 0.7361111111111112, "acc_norm_stderr": 0.03685651095897532 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.43, "acc_stderr": 0.04975698519562428, "acc_norm": 0.43, "acc_norm_stderr": 0.04975698519562428 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.53, "acc_stderr": 0.050161355804659205, "acc_norm": 0.53, "acc_norm_stderr": 0.050161355804659205 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6416184971098265, "acc_stderr": 0.036563436533531585, "acc_norm": 0.6416184971098265, "acc_norm_stderr": 0.036563436533531585 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.37254901960784315, "acc_stderr": 0.04810840148082636, "acc_norm": 0.37254901960784315, "acc_norm_stderr": 0.04810840148082636 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.74, "acc_stderr": 0.04408440022768078, "acc_norm": 0.74, "acc_norm_stderr": 0.04408440022768078 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5787234042553191, "acc_stderr": 0.03227834510146268, "acc_norm": 0.5787234042553191, "acc_norm_stderr": 0.03227834510146268 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.4824561403508772, "acc_stderr": 0.04700708033551038, "acc_norm": 0.4824561403508772, "acc_norm_stderr": 0.04700708033551038 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5241379310344828, "acc_stderr": 0.0416180850350153, "acc_norm": 0.5241379310344828, "acc_norm_stderr": 0.0416180850350153 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.42063492063492064, "acc_stderr": 0.02542483508692401, "acc_norm": 0.42063492063492064, "acc_norm_stderr": 0.02542483508692401 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.38095238095238093, "acc_stderr": 0.043435254289490965, "acc_norm": 0.38095238095238093, "acc_norm_stderr": 0.043435254289490965 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7645161290322581, "acc_stderr": 0.024137632429337714, "acc_norm": 0.7645161290322581, "acc_norm_stderr": 0.024137632429337714 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.4975369458128079, "acc_stderr": 0.03517945038691063, "acc_norm": 0.4975369458128079, "acc_norm_stderr": 0.03517945038691063 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7696969696969697, "acc_stderr": 0.0328766675860349, "acc_norm": 0.7696969696969697, "acc_norm_stderr": 0.0328766675860349 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7575757575757576, "acc_stderr": 0.030532892233932022, "acc_norm": 0.7575757575757576, "acc_norm_stderr": 0.030532892233932022 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8808290155440415, "acc_stderr": 0.023381935348121437, "acc_norm": 0.8808290155440415, "acc_norm_stderr": 0.023381935348121437 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6435897435897436, "acc_stderr": 0.02428314052946731, "acc_norm": 0.6435897435897436, "acc_norm_stderr": 0.02428314052946731 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.32222222222222224, "acc_stderr": 0.02849346509102859, "acc_norm": 0.32222222222222224, "acc_norm_stderr": 0.02849346509102859 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6722689075630253, "acc_stderr": 0.03048991141767323, "acc_norm": 0.6722689075630253, "acc_norm_stderr": 0.03048991141767323 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.31788079470198677, "acc_stderr": 0.038020397601079024, "acc_norm": 0.31788079470198677, "acc_norm_stderr": 0.038020397601079024 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8293577981651377, "acc_stderr": 0.016129271025099857, "acc_norm": 0.8293577981651377, "acc_norm_stderr": 0.016129271025099857 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5092592592592593, "acc_stderr": 0.034093869469927006, "acc_norm": 0.5092592592592593, "acc_norm_stderr": 0.034093869469927006 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.7990196078431373, "acc_stderr": 0.028125972265654373, "acc_norm": 0.7990196078431373, "acc_norm_stderr": 0.028125972265654373 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7890295358649789, "acc_stderr": 0.02655837250266192, "acc_norm": 0.7890295358649789, "acc_norm_stderr": 0.02655837250266192 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6860986547085202, "acc_stderr": 0.031146796482972465, "acc_norm": 0.6860986547085202, "acc_norm_stderr": 0.031146796482972465 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7633587786259542, "acc_stderr": 0.03727673575596913, "acc_norm": 0.7633587786259542, "acc_norm_stderr": 0.03727673575596913 }, "harness|hendrycksTest-international_law|5": { "acc": 0.8181818181818182, "acc_stderr": 0.03520893951097653, "acc_norm": 0.8181818181818182, "acc_norm_stderr": 0.03520893951097653 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7685185185185185, "acc_stderr": 0.04077494709252626, "acc_norm": 0.7685185185185185, "acc_norm_stderr": 0.04077494709252626 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7791411042944786, "acc_stderr": 0.03259177392742179, "acc_norm": 0.7791411042944786, "acc_norm_stderr": 0.03259177392742179 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.44642857142857145, "acc_stderr": 0.04718471485219588, "acc_norm": 0.44642857142857145, "acc_norm_stderr": 0.04718471485219588 }, "harness|hendrycksTest-management|5": { "acc": 0.7961165048543689, "acc_stderr": 0.039891398595317706, "acc_norm": 0.7961165048543689, "acc_norm_stderr": 0.039891398595317706 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8931623931623932, "acc_stderr": 0.020237149008990932, "acc_norm": 0.8931623931623932, "acc_norm_stderr": 0.020237149008990932 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.69, "acc_stderr": 0.04648231987117316, "acc_norm": 0.69, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8288633461047255, "acc_stderr": 0.0134682016140663, "acc_norm": 0.8288633461047255, "acc_norm_stderr": 0.0134682016140663 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7167630057803468, "acc_stderr": 0.024257901705323374, "acc_norm": 0.7167630057803468, "acc_norm_stderr": 0.024257901705323374 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.35977653631284917, "acc_stderr": 0.016051419760310267, "acc_norm": 0.35977653631284917, "acc_norm_stderr": 0.016051419760310267 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.6993464052287581, "acc_stderr": 0.026256053835718964, "acc_norm": 0.6993464052287581, "acc_norm_stderr": 0.026256053835718964 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.6977491961414791, "acc_stderr": 0.02608270069539966, "acc_norm": 0.6977491961414791, "acc_norm_stderr": 0.02608270069539966 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7067901234567902, "acc_stderr": 0.02532988817190093, "acc_norm": 0.7067901234567902, "acc_norm_stderr": 0.02532988817190093 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.45390070921985815, "acc_stderr": 0.02970045324729146, "acc_norm": 0.45390070921985815, "acc_norm_stderr": 0.02970045324729146 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.45371577574967403, "acc_stderr": 0.012715404841277743, "acc_norm": 0.45371577574967403, "acc_norm_stderr": 0.012715404841277743 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6544117647058824, "acc_stderr": 0.028888193103988633, "acc_norm": 0.6544117647058824, "acc_norm_stderr": 0.028888193103988633 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6437908496732027, "acc_stderr": 0.0193733324207245, "acc_norm": 0.6437908496732027, "acc_norm_stderr": 0.0193733324207245 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6727272727272727, "acc_stderr": 0.0449429086625209, "acc_norm": 0.6727272727272727, "acc_norm_stderr": 0.0449429086625209 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.746938775510204, "acc_stderr": 0.027833023871399673, "acc_norm": 0.746938775510204, "acc_norm_stderr": 0.027833023871399673 }, "harness|hendrycksTest-sociology|5": { "acc": 0.845771144278607, "acc_stderr": 0.02553843336857833, "acc_norm": 0.845771144278607, "acc_norm_stderr": 0.02553843336857833 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.85, "acc_stderr": 0.035887028128263686, "acc_norm": 0.85, "acc_norm_stderr": 0.035887028128263686 }, "harness|hendrycksTest-virology|5": { "acc": 0.5180722891566265, "acc_stderr": 0.03889951252827216, "acc_norm": 0.5180722891566265, "acc_norm_stderr": 0.03889951252827216 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8245614035087719, "acc_stderr": 0.029170885500727665, "acc_norm": 0.8245614035087719, "acc_norm_stderr": 0.029170885500727665 }, "harness|truthfulqa:mc|0": { "mc1": 0.423500611995104, "mc1_stderr": 0.017297421448534727, "mc2": 0.5978275429044729, "mc2_stderr": 0.015733742788933292 }, "harness|winogrande|5": { "acc": 0.7805840568271507, "acc_stderr": 0.01163126836060778 }, "harness|gsm8k|5": { "acc": 0.6171341925701289, "acc_stderr": 0.013389223491820474 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集摘要

该数据集是在对模型 cloudyu/Mixtral_7Bx4_MOE_24B 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集组成

  • 数据集包含 63 个配置,每个配置对应一个评估任务。
  • 数据集从 1 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 额外的 "results" 配置存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_cloudyu__Mixtral_7Bx4_MOE_24B", "harness_winogrande_5", split="train")

最新结果

以下是 2023-12-23T18:05:51.243288 运行的最新结果

python { "all": { "acc": 0.6322199879229019, "acc_stderr": 0.03229738563088343, "acc_norm": 0.6337436892396372, "acc_norm_stderr": 0.03294310301937023, "mc1": 0.423500611995104, "mc1_stderr": 0.017297421448534727, "mc2": 0.5978275429044729, "mc2_stderr": 0.015733742788933292 }, "harness|arc:challenge|25": { "acc": 0.6143344709897611, "acc_stderr": 0.014224250973257187, "acc_norm": 0.6535836177474402, "acc_norm_stderr": 0.013905011180063232 }, "harness|hellaswag|10": { "acc": 0.6683927504481179, "acc_stderr": 0.004698285350019217, "acc_norm": 0.852320254929297, "acc_norm_stderr": 0.0035405716545956313 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6148148148148148, "acc_stderr": 0.04203921040156279, "acc_norm": 0.6148148148148148, "acc_norm_stderr": 0.04203921040156279 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6578947368421053, "acc_stderr": 0.03860731599316092, "acc_norm": 0.6578947368421053, "acc_norm_stderr": 0.03860731599316092 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.59, "acc_stderr": 0.049431107042371025, "acc_norm": 0.59, "acc_norm_stderr": 0.049431107042371025 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6981132075471698, "acc_stderr": 0.028254200344438655, "acc_norm": 0.6981132075471698, "acc_norm_stderr": 0.028254200344438655 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7361111111111112, "acc_stderr": 0.03685651095897532, "acc_norm": 0.7361111111111112, "acc_norm_stderr": 0.03685651095897532 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.43, "acc_stderr": 0.04975698519562428, "acc_norm": 0.43, "acc_norm_stderr": 0.04975698519562428 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.53, "acc_stderr": 0.050161355804659205, "acc_norm": 0.53, "acc_norm_stderr": 0.050161355804659205 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6416184971098265, "acc_stderr": 0.036563436533531585, "acc_norm": 0.6416184971098265, "acc_norm_stderr": 0.036563436533531585 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.37254901960784315, "acc_stderr": 0.04810840148082636, "acc_norm": 0.37254901960784315, "acc_norm_stderr": 0.04810840148082636 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.74, "acc_stderr": 0.04408440022768078, "acc_norm": 0.74, "acc_norm_stderr": 0.04408440022768078 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5787234042553191, "acc_stderr": 0.03227834510146268, "acc_norm": 0.5787234042553191, "acc_norm_stderr": 0.03227834510146268 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.4824561403508772, "acc_stderr": 0.04700708033551038, "acc_norm": 0.4824561403508772, "acc_norm_stderr": 0.04700708033551038 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5241379310344828, "acc_stderr": 0.0416180850350153, "acc_norm": 0.5241379310344828, "acc_norm_stderr": 0.0416180850350153 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.42063492063492064, "acc_stderr": 0.02542483508692401, "acc_norm": 0.42063492063492064, "acc_norm_stderr": 0.02542483508692401 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.38095238095238093, "acc_stderr": 0.043435254289490965, "acc_norm": 0.38095238095238093, "acc_norm_stderr": 0.043435254289490965 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7645161290322581, "acc_stderr": 0.024137632429337714, "acc_norm": 0.7645161290322581, "acc_norm_stderr": 0.024137632429337714 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.4975369458128079, "acc_stderr": 0.03517945038691063, "acc_norm": 0.4975369458128079, "acc_norm_stderr": 0.03517945038691063 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7696969696969697, "acc_stderr": 0.0328766675860349, "acc_norm": 0.7696969696969697, "acc_norm_stderr": 0.0328766675860349 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7575757575757576, "acc_stderr": 0.030532892233932022, "acc_norm": 0.7575757575757576, "acc_norm_stderr": 0.030532892233932022 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8808290155440415, "acc_stderr": 0.023381935348121437, "acc_norm": 0.8808290155440415, "acc_norm_stderr": 0.023381935348121437 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6435897435897436, "acc_stderr": 0.02428314052946731, "acc_norm": 0.6435897435897436, "acc_norm_stderr": 0.02428314052946731 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.32222222222222224, "acc_stderr": 0.02849346509102859, "acc_norm": 0.32222222

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是在对模型cloudyu/Mixtral_7Bx4_MOE_24B进行Open LLM Leaderboard评估过程中自动生成的。数据集由63个配置组成,每个配置对应一个被评估的任务。数据来源于一次运行,每次运行的结果以时间戳命名的分割形式存储在各个配置中,其中'train'分割始终指向最新的结果。此外,一个名为'results'的额外配置存储了运行的所有聚合结果,用于在排行榜上计算和展示综合指标。
使用方法
用户可通过HuggingFace Datasets库便捷加载数据。例如,加载WinoGrande任务的详细评估结果:使用load_dataset函数,指定数据集名称'open-llm-leaderboard/details_cloudyu__Mixtral_7Bx4_MOE_24B'、任务配置'harness_winogrande_5'及分割'train'即可。加载'results'配置可获取所有任务的聚合指标,便于整体性能分析。
背景与挑战
背景概述
大语言模型(LLM)的评估体系是推动自然语言处理领域进步的关键环节。在此背景下,Hugging Face社区于2023年推出了Open LLM Leaderboard,旨在通过标准化基准测试对各类开源模型进行公正、透明的性能比较。该数据集正是为评估由cloudyu团队发布的Mixtral_7Bx4_MOE_24B模型而自动生成的评测记录。Mixtral_7Bx4_MOE_24B是一款采用混合专家(MoE)架构的高效模型,其设计理念在于以较少的激活参数实现接近稠密模型的效果。该数据集记录了2023年12月23日的一次完整评测运行,涵盖了ARC挑战赛、HellaSwag、MMLU、TruthfulQA、Winogrande和GSM8K等63个任务配置,系统性地揭示了该模型在常识推理、知识理解、数学求解及事实一致性等多维度上的表现。这一数据集不仅为研究社区提供了细粒度的模型性能洞察,也成为了后续MoE架构优化与评估标准制定的重要参考。
当前挑战
该数据集所应对的核心挑战在于如何全面且公平地衡量一款混合专家模型在多种复杂自然语言理解任务上的综合能力。从领域问题层面看,MoE模型虽在推理效率上具有优势,但其稀疏激活特性可能导致不同任务间的性能波动,例如在MMLU的抽象代数、大学数学等需深度推理的子集上,模型准确率仅为30%左右,而在高中政府与政治、市场营销等知识密集型任务中则超过88%,这种不均衡性为模型泛化能力的评估带来了显著困难。从构建过程层面看,数据集在自动生成时需处理63个独立配置的评测结果,每个配置对应不同的采样策略(如few-shot数量)和评价指标(如acc、acc_norm、mc1等),如何确保不同任务间的评分标准一致并避免数据倾斜,是技术实现上的另一重挑战。此外,评测结果以时间戳分割存储,要求后续研究者能够高效整合多次运行的数据以追踪模型迭代效果,这也对数据集的版本管理与可复现性提出了较高要求。
常用场景
经典使用场景
在自然语言处理与大规模语言模型评估的交叉领域,该数据集作为Open LLM Leaderboard的评估记录,承载了对Mixtral_7Bx4_MOE_24B模型在63个细分任务上的详尽性能数据。其经典使用场景在于为研究者提供标准化的多任务评测基准,涵盖ARC挑战赛、HellaSwag常识推理、GSM8K数学问题求解以及涵盖57个学科的MMLU测试等,从而系统性地衡量模型在推理、知识掌握与语言理解维度的综合能力。通过加载特定配置如'harness_winogrande_5',可复现模型在代词消歧任务上的表现,为模型间横向比较与能力剖析奠定数据基础。
解决学术问题
该数据集的核心价值在于解决了大语言模型评估中缺乏细粒度、可复现基准的学术困境。传统评估往往依赖单一指标或少量任务,难以全面反映模型在复杂认知任务上的真实水平。该数据集通过集成多源评测框架(如语言模型评估工具LM Evaluation Harness),提供了涵盖常识推理、数学推理、事实知识与伦理判断等维度的结构化结果,使研究者能够精准定位模型的能力短板与优势领域。其意义在于推动了模型评估的标准化进程,为后续模型优化、架构改进与训练策略调整提供了可量化的反馈闭环,深刻影响了开放语言模型社区的研究范式。
实际应用
在实际应用层面,该数据集为模型选型与部署决策提供了关键依据。工程师与产品团队可依据该数据集中模型在具体任务上的得分,如GSM8K的61.7%准确率或HellaSwag的85.2%归一化准确率,判断模型是否适用于教育辅导、智能客服或知识问答等场景。此外,数据集的细粒度结果有助于识别模型在特定领域(如法律、医学)的可靠性,从而在金融、医疗等高风险应用中规避潜在偏差。其结构化的评估数据还可被集成至自动化模型监控流水线中,持续追踪模型迭代对下游任务的影响,确保实际部署中的性能稳定性。
数据集最近研究
最新研究方向
当前,大语言模型评估领域的前沿研究聚焦于混合专家模型(MoE)的效能验证与标准化评测。open-llm-leaderboard-old数据集记录了cloudyu/Mixtral_7Bx4_MOE_24B模型在Open LLM Leaderboard上的多任务评估结果,涵盖ARC-Challenge、HellaSwag、MMLU(57个学科子集)、TruthfulQA、Winogrande及GSM8K等基准。该数据集通过63种配置与标准化评估流程,系统揭示了MoE架构在推理、常识理解、数学及知识问答等维度的综合表现,为稀疏激活模型与传统稠密模型的性能对比提供了关键实证。其评测框架的自动化与可复现性,推动了社区对MoE模型在参数效率与任务泛化间权衡的深入探讨,尤其为2023年底以来MoE类模型(如Mixtral 8x7B)的爆发式研究奠定了评估基准,对高效大模型设计与部署具有重要的指导意义。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作