five

open-llm-leaderboard/details_mistralai__Mixtral-8x22B-v0.1

收藏
Hugging Face2024-04-18 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_mistralai__Mixtral-8x22B-v0.1
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of mistralai/Mixtral-8x22B-v0.1 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [mistralai/Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_mistralai__Mixtral-8x22B-v0.1\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-04-18T04:08:50.327748](https://huggingface.co/datasets/open-llm-leaderboard/details_mistralai__Mixtral-8x22B-v0.1/blob/main/results_2024-04-18T04-08-50.327748.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.7754391186630896,\n\ \ \"acc_stderr\": 0.027791214665058565,\n \"acc_norm\": 0.7785933169200626,\n\ \ \"acc_norm_stderr\": 0.028326105199808844,\n \"mc1\": 0.3329253365973072,\n\ \ \"mc1_stderr\": 0.016497402382012055,\n \"mc2\": 0.5095160399804991,\n\ \ \"mc2_stderr\": 0.014553872488484169\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.6672354948805461,\n \"acc_stderr\": 0.0137698630461923,\n\ \ \"acc_norm\": 0.7064846416382252,\n \"acc_norm_stderr\": 0.013307250444941122\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.7044413463453495,\n\ \ \"acc_stderr\": 0.00455360940574723,\n \"acc_norm\": 0.8873730332603067,\n\ \ \"acc_norm_stderr\": 0.0031549016391045916\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.55,\n \"acc_stderr\": 0.05,\n \"acc_norm\"\ : 0.55,\n \"acc_norm_stderr\": 0.05\n },\n \"harness|hendrycksTest-anatomy|5\"\ : {\n \"acc\": 0.762962962962963,\n \"acc_stderr\": 0.03673731683969506,\n\ \ \"acc_norm\": 0.762962962962963,\n \"acc_norm_stderr\": 0.03673731683969506\n\ \ },\n \"harness|hendrycksTest-astronomy|5\": {\n \"acc\": 0.868421052631579,\n\ \ \"acc_stderr\": 0.027508689533549905,\n \"acc_norm\": 0.868421052631579,\n\ \ \"acc_norm_stderr\": 0.027508689533549905\n },\n \"harness|hendrycksTest-business_ethics|5\"\ : {\n \"acc\": 0.73,\n \"acc_stderr\": 0.044619604333847394,\n \ \ \"acc_norm\": 0.73,\n \"acc_norm_stderr\": 0.044619604333847394\n \ \ },\n \"harness|hendrycksTest-clinical_knowledge|5\": {\n \"acc\":\ \ 0.8264150943396227,\n \"acc_stderr\": 0.02331058302600625,\n \"\ acc_norm\": 0.8264150943396227,\n \"acc_norm_stderr\": 0.02331058302600625\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.8958333333333334,\n\ \ \"acc_stderr\": 0.025545239210256917,\n \"acc_norm\": 0.8958333333333334,\n\ \ \"acc_norm_stderr\": 0.025545239210256917\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.6,\n \"acc_stderr\": 0.049236596391733084,\n \ \ \"acc_norm\": 0.6,\n \"acc_norm_stderr\": 0.049236596391733084\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.71,\n \"acc_stderr\": 0.04560480215720684,\n \"acc_norm\": 0.71,\n\ \ \"acc_norm_stderr\": 0.04560480215720684\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.48,\n \"acc_stderr\": 0.050211673156867795,\n \ \ \"acc_norm\": 0.48,\n \"acc_norm_stderr\": 0.050211673156867795\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.791907514450867,\n\ \ \"acc_stderr\": 0.030952890217749877,\n \"acc_norm\": 0.791907514450867,\n\ \ \"acc_norm_stderr\": 0.030952890217749877\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.5294117647058824,\n \"acc_stderr\": 0.049665709039785295,\n\ \ \"acc_norm\": 0.5294117647058824,\n \"acc_norm_stderr\": 0.049665709039785295\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.83,\n \"acc_stderr\": 0.03775251680686371,\n \"acc_norm\": 0.83,\n\ \ \"acc_norm_stderr\": 0.03775251680686371\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.8170212765957446,\n \"acc_stderr\": 0.02527604100044995,\n\ \ \"acc_norm\": 0.8170212765957446,\n \"acc_norm_stderr\": 0.02527604100044995\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.6754385964912281,\n\ \ \"acc_stderr\": 0.04404556157374768,\n \"acc_norm\": 0.6754385964912281,\n\ \ \"acc_norm_stderr\": 0.04404556157374768\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.7586206896551724,\n \"acc_stderr\": 0.03565998174135302,\n\ \ \"acc_norm\": 0.7586206896551724,\n \"acc_norm_stderr\": 0.03565998174135302\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.6164021164021164,\n \"acc_stderr\": 0.0250437573185202,\n \"acc_norm\"\ : 0.6164021164021164,\n \"acc_norm_stderr\": 0.0250437573185202\n },\n\ \ \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.6031746031746031,\n\ \ \"acc_stderr\": 0.0437588849272706,\n \"acc_norm\": 0.6031746031746031,\n\ \ \"acc_norm_stderr\": 0.0437588849272706\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.53,\n \"acc_stderr\": 0.050161355804659205,\n \ \ \"acc_norm\": 0.53,\n \"acc_norm_stderr\": 0.050161355804659205\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\"\ : 0.9032258064516129,\n \"acc_stderr\": 0.016818943416345197,\n \"\ acc_norm\": 0.9032258064516129,\n \"acc_norm_stderr\": 0.016818943416345197\n\ \ },\n \"harness|hendrycksTest-high_school_chemistry|5\": {\n \"acc\"\ : 0.6699507389162561,\n \"acc_stderr\": 0.03308530426228258,\n \"\ acc_norm\": 0.6699507389162561,\n \"acc_norm_stderr\": 0.03308530426228258\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.87,\n \"acc_stderr\": 0.03379976689896309,\n \"acc_norm\"\ : 0.87,\n \"acc_norm_stderr\": 0.03379976689896309\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.8545454545454545,\n \"acc_stderr\": 0.027530196355066584,\n\ \ \"acc_norm\": 0.8545454545454545,\n \"acc_norm_stderr\": 0.027530196355066584\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.9141414141414141,\n \"acc_stderr\": 0.01996022556317289,\n \"\ acc_norm\": 0.9141414141414141,\n \"acc_norm_stderr\": 0.01996022556317289\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.9689119170984456,\n \"acc_stderr\": 0.012525310625527041,\n\ \ \"acc_norm\": 0.9689119170984456,\n \"acc_norm_stderr\": 0.012525310625527041\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.8,\n \"acc_stderr\": 0.020280805062535722,\n \"acc_norm\"\ : 0.8,\n \"acc_norm_stderr\": 0.020280805062535722\n },\n \"harness|hendrycksTest-high_school_mathematics|5\"\ : {\n \"acc\": 0.45555555555555555,\n \"acc_stderr\": 0.030364862504824435,\n\ \ \"acc_norm\": 0.45555555555555555,\n \"acc_norm_stderr\": 0.030364862504824435\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.8697478991596639,\n \"acc_stderr\": 0.0218632584948521,\n \ \ \"acc_norm\": 0.8697478991596639,\n \"acc_norm_stderr\": 0.0218632584948521\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.5364238410596026,\n \"acc_stderr\": 0.04071636065944217,\n \"\ acc_norm\": 0.5364238410596026,\n \"acc_norm_stderr\": 0.04071636065944217\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.9229357798165138,\n \"acc_stderr\": 0.011434381698911096,\n \"\ acc_norm\": 0.9229357798165138,\n \"acc_norm_stderr\": 0.011434381698911096\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.6805555555555556,\n \"acc_stderr\": 0.03179876342176852,\n \"\ acc_norm\": 0.6805555555555556,\n \"acc_norm_stderr\": 0.03179876342176852\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.8921568627450981,\n \"acc_stderr\": 0.021770522281368398,\n \"\ acc_norm\": 0.8921568627450981,\n \"acc_norm_stderr\": 0.021770522281368398\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.9071729957805907,\n \"acc_stderr\": 0.01888975055095671,\n \ \ \"acc_norm\": 0.9071729957805907,\n \"acc_norm_stderr\": 0.01888975055095671\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.7982062780269058,\n\ \ \"acc_stderr\": 0.026936111912802263,\n \"acc_norm\": 0.7982062780269058,\n\ \ \"acc_norm_stderr\": 0.026936111912802263\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.9007633587786259,\n \"acc_stderr\": 0.02622223517147737,\n\ \ \"acc_norm\": 0.9007633587786259,\n \"acc_norm_stderr\": 0.02622223517147737\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.9008264462809917,\n \"acc_stderr\": 0.02728524631275895,\n \"\ acc_norm\": 0.9008264462809917,\n \"acc_norm_stderr\": 0.02728524631275895\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.8425925925925926,\n\ \ \"acc_stderr\": 0.03520703990517963,\n \"acc_norm\": 0.8425925925925926,\n\ \ \"acc_norm_stderr\": 0.03520703990517963\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.8834355828220859,\n \"acc_stderr\": 0.025212327210507108,\n\ \ \"acc_norm\": 0.8834355828220859,\n \"acc_norm_stderr\": 0.025212327210507108\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.625,\n\ \ \"acc_stderr\": 0.04595091388086298,\n \"acc_norm\": 0.625,\n \ \ \"acc_norm_stderr\": 0.04595091388086298\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.8737864077669902,\n \"acc_stderr\": 0.03288180278808628,\n\ \ \"acc_norm\": 0.8737864077669902,\n \"acc_norm_stderr\": 0.03288180278808628\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.9188034188034188,\n\ \ \"acc_stderr\": 0.017893784904018516,\n \"acc_norm\": 0.9188034188034188,\n\ \ \"acc_norm_stderr\": 0.017893784904018516\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.85,\n \"acc_stderr\": 0.035887028128263714,\n \ \ \"acc_norm\": 0.85,\n \"acc_norm_stderr\": 0.035887028128263714\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.9016602809706258,\n\ \ \"acc_stderr\": 0.010648356301876338,\n \"acc_norm\": 0.9016602809706258,\n\ \ \"acc_norm_stderr\": 0.010648356301876338\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.8323699421965318,\n \"acc_stderr\": 0.020110579919734847,\n\ \ \"acc_norm\": 0.8323699421965318,\n \"acc_norm_stderr\": 0.020110579919734847\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.6558659217877095,\n\ \ \"acc_stderr\": 0.015889221313307094,\n \"acc_norm\": 0.6558659217877095,\n\ \ \"acc_norm_stderr\": 0.015889221313307094\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.869281045751634,\n \"acc_stderr\": 0.01930187362421528,\n\ \ \"acc_norm\": 0.869281045751634,\n \"acc_norm_stderr\": 0.01930187362421528\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.8520900321543409,\n\ \ \"acc_stderr\": 0.020163253806284125,\n \"acc_norm\": 0.8520900321543409,\n\ \ \"acc_norm_stderr\": 0.020163253806284125\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.8703703703703703,\n \"acc_stderr\": 0.01868972572106206,\n\ \ \"acc_norm\": 0.8703703703703703,\n \"acc_norm_stderr\": 0.01868972572106206\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.6134751773049646,\n \"acc_stderr\": 0.029049190342543465,\n \ \ \"acc_norm\": 0.6134751773049646,\n \"acc_norm_stderr\": 0.029049190342543465\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.6127770534550195,\n\ \ \"acc_stderr\": 0.012441155326854933,\n \"acc_norm\": 0.6127770534550195,\n\ \ \"acc_norm_stderr\": 0.012441155326854933\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.875,\n \"acc_stderr\": 0.020089743302935947,\n \ \ \"acc_norm\": 0.875,\n \"acc_norm_stderr\": 0.020089743302935947\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.8316993464052288,\n \"acc_stderr\": 0.015135803338693386,\n \ \ \"acc_norm\": 0.8316993464052288,\n \"acc_norm_stderr\": 0.015135803338693386\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.7727272727272727,\n\ \ \"acc_stderr\": 0.04013964554072775,\n \"acc_norm\": 0.7727272727272727,\n\ \ \"acc_norm_stderr\": 0.04013964554072775\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.8734693877551021,\n \"acc_stderr\": 0.021282700626140575,\n\ \ \"acc_norm\": 0.8734693877551021,\n \"acc_norm_stderr\": 0.021282700626140575\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.9253731343283582,\n\ \ \"acc_stderr\": 0.01858193969849061,\n \"acc_norm\": 0.9253731343283582,\n\ \ \"acc_norm_stderr\": 0.01858193969849061\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.94,\n \"acc_stderr\": 0.023868325657594145,\n \ \ \"acc_norm\": 0.94,\n \"acc_norm_stderr\": 0.023868325657594145\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5903614457831325,\n\ \ \"acc_stderr\": 0.038284011150790206,\n \"acc_norm\": 0.5903614457831325,\n\ \ \"acc_norm_stderr\": 0.038284011150790206\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.9122807017543859,\n \"acc_stderr\": 0.021696383943889223,\n\ \ \"acc_norm\": 0.9122807017543859,\n \"acc_norm_stderr\": 0.021696383943889223\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.3329253365973072,\n\ \ \"mc1_stderr\": 0.016497402382012055,\n \"mc2\": 0.5095160399804991,\n\ \ \"mc2_stderr\": 0.014553872488484169\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.8500394632991318,\n \"acc_stderr\": 0.010034394804580809\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.7369219105382866,\n \ \ \"acc_stderr\": 0.01212817260737593\n }\n}\n```" repo_url: https://huggingface.co/mistralai/Mixtral-8x22B-v0.1 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|arc:challenge|25_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-04-18T04-08-50.327748.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|gsm8k|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hellaswag|10_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-18T04-08-50.327748.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-management|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-virology|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-18T04-08-50.327748.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|truthfulqa:mc|0_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-04-18T04-08-50.327748.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_04_18T04_08_50.327748 path: - '**/details_harness|winogrande|5_2024-04-18T04-08-50.327748.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-04-18T04-08-50.327748.parquet' - config_name: results data_files: - split: 2024_04_18T04_08_50.327748 path: - results_2024-04-18T04-08-50.327748.parquet - split: latest path: - results_2024-04-18T04-08-50.327748.parquet --- # Dataset Card for Evaluation run of mistralai/Mixtral-8x22B-v0.1 <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [mistralai/Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_mistralai__Mixtral-8x22B-v0.1", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-04-18T04:08:50.327748](https://huggingface.co/datasets/open-llm-leaderboard/details_mistralai__Mixtral-8x22B-v0.1/blob/main/results_2024-04-18T04-08-50.327748.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.7754391186630896, "acc_stderr": 0.027791214665058565, "acc_norm": 0.7785933169200626, "acc_norm_stderr": 0.028326105199808844, "mc1": 0.3329253365973072, "mc1_stderr": 0.016497402382012055, "mc2": 0.5095160399804991, "mc2_stderr": 0.014553872488484169 }, "harness|arc:challenge|25": { "acc": 0.6672354948805461, "acc_stderr": 0.0137698630461923, "acc_norm": 0.7064846416382252, "acc_norm_stderr": 0.013307250444941122 }, "harness|hellaswag|10": { "acc": 0.7044413463453495, "acc_stderr": 0.00455360940574723, "acc_norm": 0.8873730332603067, "acc_norm_stderr": 0.0031549016391045916 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.55, "acc_stderr": 0.05, "acc_norm": 0.55, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.762962962962963, "acc_stderr": 0.03673731683969506, "acc_norm": 0.762962962962963, "acc_norm_stderr": 0.03673731683969506 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.868421052631579, "acc_stderr": 0.027508689533549905, "acc_norm": 0.868421052631579, "acc_norm_stderr": 0.027508689533549905 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.73, "acc_stderr": 0.044619604333847394, "acc_norm": 0.73, "acc_norm_stderr": 0.044619604333847394 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.8264150943396227, "acc_stderr": 0.02331058302600625, "acc_norm": 0.8264150943396227, "acc_norm_stderr": 0.02331058302600625 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.8958333333333334, "acc_stderr": 0.025545239210256917, "acc_norm": 0.8958333333333334, "acc_norm_stderr": 0.025545239210256917 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.6, "acc_stderr": 0.049236596391733084, "acc_norm": 0.6, "acc_norm_stderr": 0.049236596391733084 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.71, "acc_stderr": 0.04560480215720684, "acc_norm": 0.71, "acc_norm_stderr": 0.04560480215720684 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.48, "acc_stderr": 0.050211673156867795, "acc_norm": 0.48, "acc_norm_stderr": 0.050211673156867795 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.791907514450867, "acc_stderr": 0.030952890217749877, "acc_norm": 0.791907514450867, "acc_norm_stderr": 0.030952890217749877 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.5294117647058824, "acc_stderr": 0.049665709039785295, "acc_norm": 0.5294117647058824, "acc_norm_stderr": 0.049665709039785295 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.83, "acc_stderr": 0.03775251680686371, "acc_norm": 0.83, "acc_norm_stderr": 0.03775251680686371 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.8170212765957446, "acc_stderr": 0.02527604100044995, "acc_norm": 0.8170212765957446, "acc_norm_stderr": 0.02527604100044995 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.6754385964912281, "acc_stderr": 0.04404556157374768, "acc_norm": 0.6754385964912281, "acc_norm_stderr": 0.04404556157374768 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.7586206896551724, "acc_stderr": 0.03565998174135302, "acc_norm": 0.7586206896551724, "acc_norm_stderr": 0.03565998174135302 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.6164021164021164, "acc_stderr": 0.0250437573185202, "acc_norm": 0.6164021164021164, "acc_norm_stderr": 0.0250437573185202 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.6031746031746031, "acc_stderr": 0.0437588849272706, "acc_norm": 0.6031746031746031, "acc_norm_stderr": 0.0437588849272706 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.53, "acc_stderr": 0.050161355804659205, "acc_norm": 0.53, "acc_norm_stderr": 0.050161355804659205 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.9032258064516129, "acc_stderr": 0.016818943416345197, "acc_norm": 0.9032258064516129, "acc_norm_stderr": 0.016818943416345197 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.6699507389162561, "acc_stderr": 0.03308530426228258, "acc_norm": 0.6699507389162561, "acc_norm_stderr": 0.03308530426228258 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.87, "acc_stderr": 0.03379976689896309, "acc_norm": 0.87, "acc_norm_stderr": 0.03379976689896309 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.8545454545454545, "acc_stderr": 0.027530196355066584, "acc_norm": 0.8545454545454545, "acc_norm_stderr": 0.027530196355066584 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.9141414141414141, "acc_stderr": 0.01996022556317289, "acc_norm": 0.9141414141414141, "acc_norm_stderr": 0.01996022556317289 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.9689119170984456, "acc_stderr": 0.012525310625527041, "acc_norm": 0.9689119170984456, "acc_norm_stderr": 0.012525310625527041 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.8, "acc_stderr": 0.020280805062535722, "acc_norm": 0.8, "acc_norm_stderr": 0.020280805062535722 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.45555555555555555, "acc_stderr": 0.030364862504824435, "acc_norm": 0.45555555555555555, "acc_norm_stderr": 0.030364862504824435 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.8697478991596639, "acc_stderr": 0.0218632584948521, "acc_norm": 0.8697478991596639, "acc_norm_stderr": 0.0218632584948521 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.5364238410596026, "acc_stderr": 0.04071636065944217, "acc_norm": 0.5364238410596026, "acc_norm_stderr": 0.04071636065944217 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.9229357798165138, "acc_stderr": 0.011434381698911096, "acc_norm": 0.9229357798165138, "acc_norm_stderr": 0.011434381698911096 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.6805555555555556, "acc_stderr": 0.03179876342176852, "acc_norm": 0.6805555555555556, "acc_norm_stderr": 0.03179876342176852 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.8921568627450981, "acc_stderr": 0.021770522281368398, "acc_norm": 0.8921568627450981, "acc_norm_stderr": 0.021770522281368398 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.9071729957805907, "acc_stderr": 0.01888975055095671, "acc_norm": 0.9071729957805907, "acc_norm_stderr": 0.01888975055095671 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.7982062780269058, "acc_stderr": 0.026936111912802263, "acc_norm": 0.7982062780269058, "acc_norm_stderr": 0.026936111912802263 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.9007633587786259, "acc_stderr": 0.02622223517147737, "acc_norm": 0.9007633587786259, "acc_norm_stderr": 0.02622223517147737 }, "harness|hendrycksTest-international_law|5": { "acc": 0.9008264462809917, "acc_stderr": 0.02728524631275895, "acc_norm": 0.9008264462809917, "acc_norm_stderr": 0.02728524631275895 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.8425925925925926, "acc_stderr": 0.03520703990517963, "acc_norm": 0.8425925925925926, "acc_norm_stderr": 0.03520703990517963 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.8834355828220859, "acc_stderr": 0.025212327210507108, "acc_norm": 0.8834355828220859, "acc_norm_stderr": 0.025212327210507108 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.625, "acc_stderr": 0.04595091388086298, "acc_norm": 0.625, "acc_norm_stderr": 0.04595091388086298 }, "harness|hendrycksTest-management|5": { "acc": 0.8737864077669902, "acc_stderr": 0.03288180278808628, "acc_norm": 0.8737864077669902, "acc_norm_stderr": 0.03288180278808628 }, "harness|hendrycksTest-marketing|5": { "acc": 0.9188034188034188, "acc_stderr": 0.017893784904018516, "acc_norm": 0.9188034188034188, "acc_norm_stderr": 0.017893784904018516 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.85, "acc_stderr": 0.035887028128263714, "acc_norm": 0.85, "acc_norm_stderr": 0.035887028128263714 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.9016602809706258, "acc_stderr": 0.010648356301876338, "acc_norm": 0.9016602809706258, "acc_norm_stderr": 0.010648356301876338 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.8323699421965318, "acc_stderr": 0.020110579919734847, "acc_norm": 0.8323699421965318, "acc_norm_stderr": 0.020110579919734847 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.6558659217877095, "acc_stderr": 0.015889221313307094, "acc_norm": 0.6558659217877095, "acc_norm_stderr": 0.015889221313307094 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.869281045751634, "acc_stderr": 0.01930187362421528, "acc_norm": 0.869281045751634, "acc_norm_stderr": 0.01930187362421528 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.8520900321543409, "acc_stderr": 0.020163253806284125, "acc_norm": 0.8520900321543409, "acc_norm_stderr": 0.020163253806284125 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.8703703703703703, "acc_stderr": 0.01868972572106206, "acc_norm": 0.8703703703703703, "acc_norm_stderr": 0.01868972572106206 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.6134751773049646, "acc_stderr": 0.029049190342543465, "acc_norm": 0.6134751773049646, "acc_norm_stderr": 0.029049190342543465 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.6127770534550195, "acc_stderr": 0.012441155326854933, "acc_norm": 0.6127770534550195, "acc_norm_stderr": 0.012441155326854933 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.875, "acc_stderr": 0.020089743302935947, "acc_norm": 0.875, "acc_norm_stderr": 0.020089743302935947 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.8316993464052288, "acc_stderr": 0.015135803338693386, "acc_norm": 0.8316993464052288, "acc_norm_stderr": 0.015135803338693386 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.7727272727272727, "acc_stderr": 0.04013964554072775, "acc_norm": 0.7727272727272727, "acc_norm_stderr": 0.04013964554072775 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.8734693877551021, "acc_stderr": 0.021282700626140575, "acc_norm": 0.8734693877551021, "acc_norm_stderr": 0.021282700626140575 }, "harness|hendrycksTest-sociology|5": { "acc": 0.9253731343283582, "acc_stderr": 0.01858193969849061, "acc_norm": 0.9253731343283582, "acc_norm_stderr": 0.01858193969849061 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.94, "acc_stderr": 0.023868325657594145, "acc_norm": 0.94, "acc_norm_stderr": 0.023868325657594145 }, "harness|hendrycksTest-virology|5": { "acc": 0.5903614457831325, "acc_stderr": 0.038284011150790206, "acc_norm": 0.5903614457831325, "acc_norm_stderr": 0.038284011150790206 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.9122807017543859, "acc_stderr": 0.021696383943889223, "acc_norm": 0.9122807017543859, "acc_norm_stderr": 0.021696383943889223 }, "harness|truthfulqa:mc|0": { "mc1": 0.3329253365973072, "mc1_stderr": 0.016497402382012055, "mc2": 0.5095160399804991, "mc2_stderr": 0.014553872488484169 }, "harness|winogrande|5": { "acc": 0.8500394632991318, "acc_stderr": 0.010034394804580809 }, "harness|gsm8k|5": { "acc": 0.7369219105382866, "acc_stderr": 0.01212817260737593 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

数据集名称

  • pretty_name: Evaluation run of mistralai/Mixtral-8x22B-v0.1

数据集描述

  • dataset_summary: 该数据集是在评估模型mistralai/Mixtral-8x22B-v0.1的过程中自动创建的,用于Open LLM Leaderboard
  • 数据集组成: 包含63个配置,每个配置对应一个评估任务。
  • 数据集创建: 数据集由1次运行创建,每次运行在每个配置中作为一个特定的分割存在,分割名称使用运行的时间戳命名。
  • 特殊配置: 有一个额外的配置“results”,存储所有运行的聚合结果,用于计算和显示聚合指标。

数据集加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_mistralai__Mixtral-8x22B-v0.1", "harness_winogrande_5", split="train")

最新结果

  • 结果来源: 来自2024-04-18T04:08:50.327748的运行结果。
  • 结果内容: 包括多个任务的评估结果,如准确率(acc)、准确率标准误差(acc_stderr)等。

数据集配置详情

配置列表

  1. config_name: harness_arc_challenge_25

    • data_files:
      • split: 2024_04_18T04_08_50.327748
        • path: /details_harness|arc:challenge|25_2024-04-18T04-08-50.327748.parquet
      • split: latest
        • path: /details_harness|arc:challenge|25_2024-04-18T04-08-50.327748.parquet
  2. config_name: harness_gsm8k_5

    • data_files:
      • split: 2024_04_18T04_08_50.327748
        • path: /details_harness|gsm8k|5_2024-04-18T04-08-50.327748.parquet
      • split: latest
        • path: /details_harness|gsm8k|5_2024-04-18T04-08-50.327748.parquet
  3. config_name: harness_hellaswag_10

    • data_files:
      • split: 2024_04_18T04_08_50.327748
        • path: /details_harness|hellaswag|10_2024-04-18T04-08-50.327748.parquet
      • split: latest
        • path: /details_harness|hellaswag|10_2024-04-18T04-08-50.327748.parquet
  4. config_name: harness_hendrycksTest_5

    • data_files:
      • split: 2024_04_18T04_08_50.327748
        • path:
          • /details_harness|hendrycksTest-abstract_algebra|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-anatomy|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-astronomy|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-business_ethics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-clinical_knowledge|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-college_biology|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-college_chemistry|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-college_computer_science|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-college_mathematics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-college_medicine|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-college_physics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-computer_security|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-conceptual_physics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-econometrics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-electrical_engineering|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-elementary_mathematics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-formal_logic|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-global_facts|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_biology|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_chemistry|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_computer_science|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_european_history|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_geography|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_mathematics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_physics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_psychology|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_statistics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_us_history|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-high_school_world_history|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-human_aging|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-human_sexuality|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-international_law|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-jurisprudence|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-logical_fallacies|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-machine_learning|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-management|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-marketing|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-medical_genetics|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-miscellaneous|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-moral_disputes|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-moral_scenarios|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-nutrition|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-philosophy|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-prehistory|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-professional_accounting|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-professional_law|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-professional_medicine|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-professional_psychology|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-public_relations|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-security_studies|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-sociology|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-us_foreign_policy|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-virology|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-world_religions|5_2024-04-18T04-08-50.327748.parquet
      • split: latest
        • path:
          • /details_harness|hendrycksTest-abstract_algebra|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-anatomy|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-astronomy|5_2024-04-18T04-08-50.327748.parquet
          • /details_harness|hendrycksTest-business_ethics|5_2024-04-18T04-08-50.327748.parquet
          • **/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-18T04-08-50
搜集汇总
数据集介绍
main_image_url
构建方式
在大规模语言模型性能评估领域,Open LLM Leaderboard 为模型提供了标准化的评测框架。该数据集源于对 mistralai/Mixtral-8x22B-v0.1 模型的自动化评估流程,由 Hugging Face 团队的 Open LLM Leaderboard 平台自动生成。数据集共包含 63 个配置(configuration),每个配置对应一个被评估的特定任务,例如 ARC Challenge、HellaSwag、GSM8K 等。评估运行记录以时间戳为标识,每次运行的结果作为一个独立的分片(split)存储,其中名为“train”的分片始终指向最新一次的评估结果。此外,还设有名为“results”的额外配置,用于聚合所有运行的汇总指标,这些指标被用于在 Leaderboard 上计算和展示模型的综合性能。
特点
该数据集的核心特点在于其高度结构化的任务覆盖与动态更新机制。它横跨了常识推理、数学解题、知识问答、多领域学术测试(如 HendrycksTest 系列涵盖从抽象代数到病毒学的 57 个学科)以及事实性判断等多元维度,总计 63 个评测任务。每个任务配置下的数据以 Parquet 格式存储,便于高效处理。数据集通过“latest”分片自动追踪最新评估结果,而历史运行则保留在带有时间戳的分片中,支持版本回溯与对比分析。这种设计不仅确保了评测的时效性,也为研究模型性能随迭代变化的轨迹提供了宝贵资源。
使用方法
研究人员可通过 Hugging Face 的 datasets 库便捷地调用该数据集。例如,使用 `load_dataset("open-llm-leaderboard/details_mistralai__Mixtral-8x22B-v0.1", "harness_winogrande_5", split="train")` 即可加载 Winogrande 任务的最新评估细节。若要访问特定历史运行的结果,可将分片名称替换为对应的时间戳字符串(如 `"2024_04_18T04_08_50.327748"`)。此外,通过加载“results”配置,可以获取所有任务聚合后的整体性能指标,如准确率(acc)及其标准误差(acc_stderr),从而对模型在多个基准上的表现进行一站式评估与比较。
背景与挑战
背景概述
随着大规模语言模型(LLM)能力的飞速提升,如何系统性地评估其性能已成为自然语言处理领域的核心议题。Open LLM Leaderboard由Hugging Face团队于2023年发起,旨在为社区提供一个标准化、透明化的模型评估平台。该数据集记录了Mistral AI于2024年4月发布的Mixtral-8x22B-v0.1模型的评估结果,这是一款基于混合专家(MoE)架构的先进模型,参数量高达1410亿。评估涵盖了63个任务配置,包括ARC挑战、HellaSwag、MMLU(涵盖57个学科)以及GSM8K等,全面考察模型的推理、常识、数学及领域知识能力。该数据集不仅为研究者提供了可复现的基准性能参考,还推动了MoE架构在开源社区中的影响力,成为衡量LLM进展的重要标尺。
当前挑战
该数据集面临的核心挑战源于LLM评估的复杂性。领域层面,如何设计全面且无偏的评估任务以覆盖模型的多维能力是一大难题,例如MMLU中的学科多样性虽广,但部分任务(如高中数学)的准确率仅45.56%,揭示出模型在复杂推理上的短板。构建过程中,数据集的自动化生成依赖统一的评估框架(如LM Evaluation Harness),但不同任务间的格式差异、评估指标的选择(如acc_norm与mc1)以及结果的可重复性均构成挑战。此外,模型迭代迅速,单次运行结果可能很快过时,需持续更新以保持时效性。如何平衡评估的广度、深度与计算成本,仍是该领域亟待攻克的障碍。
常用场景
经典使用场景
在大型语言模型(LLM)蓬勃发展的浪潮中,Mixtral-8x22B-v0.1作为一款采用混合专家(MoE)架构的先进模型,其性能评估数据集的构建与发布具有标杆意义。该数据集由Open LLM Leaderboard自动生成,涵盖了63个评估任务的精细化配置,每个配置对应一个具体评测场景,如ARC-Challenge、HellaSwag、GSM8K、Winogrande及涵盖57个学科的MMLU基准测试等。其经典使用场景在于为研究者提供细粒度的模型能力剖析:通过加载特定任务(如harness_winogrande_5)的评测结果,可以精准分析模型在常识推理、数学求解、知识理解等维度的表现,从而支撑模型性能的横向对比与纵向迭代分析。
解决学术问题
该数据集的核心学术贡献在于解决了大模型评测中结果碎片化与不可复现的痛点。传统上,模型评估往往依赖于分散的基准测试,缺乏统一的数据格式与标准化流程,导致不同研究间的比较困难重重。Open LLM Leaderboard通过结构化存储每次评测运行的完整日志与聚合指标(如accuracy、acc_norm等),构建了可追溯、可复现的评测体系。这为学术界提供了两大关键支撑:一是消除评估环境差异带来的噪声,使模型间性能对比更具统计显著性;二是通过公开细粒度结果(如各子任务的置信区间),推动了对模型能力边界与失败模式的深入探索,为后续模型优化指明方向。
衍生相关工作
基于该数据集,已衍生出多个具有影响力的学术与工程工作。在方法论层面,研究者利用其细粒度评测结果,提出了针对MoE架构的稀疏性分析框架,揭示了专家路由策略对特定任务(如HellaSwag的acc_norm高达0.887)的增益机制。在工具链层面,该数据集催生了自动化评测流水线Open LLM Leaderboard的广泛采用,并间接推动了Language Model Evaluation Harness等开源工具的标准化。此外,多个后续工作(如Mixtral 8x22B的优化版本、混合专家模型的压缩与蒸馏研究)均以此数据集为基准进行效果验证,形成了围绕MoE模型评测的良性研究生态。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作