five

open-llm-leaderboard-old/details_cloudyu__Mixtral_7Bx6_MoE_35B

收藏
Hugging Face2024-01-14 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_cloudyu__Mixtral_7Bx6_MoE_35B
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of cloudyu/Mixtral_7Bx6_MoE_35B dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [cloudyu/Mixtral_7Bx6_MoE_35B](https://huggingface.co/cloudyu/Mixtral_7Bx6_MoE_35B)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_cloudyu__Mixtral_7Bx6_MoE_35B\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-01-14T16:00:09.048254](https://huggingface.co/datasets/open-llm-leaderboard/details_cloudyu__Mixtral_7Bx6_MoE_35B/blob/main/results_2024-01-14T16-00-09.048254.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6538307305259115,\n\ \ \"acc_stderr\": 0.03206532838135927,\n \"acc_norm\": 0.6536540314559122,\n\ \ \"acc_norm_stderr\": 0.03272839976259325,\n \"mc1\": 0.5055079559363526,\n\ \ \"mc1_stderr\": 0.01750243899045107,\n \"mc2\": 0.6576763693172452,\n\ \ \"mc2_stderr\": 0.01500859930650817\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.674061433447099,\n \"acc_stderr\": 0.013697432466693246,\n\ \ \"acc_norm\": 0.6996587030716723,\n \"acc_norm_stderr\": 0.013395909309957005\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6878111929894444,\n\ \ \"acc_stderr\": 0.0046243936909669,\n \"acc_norm\": 0.8681537542322246,\n\ \ \"acc_norm_stderr\": 0.0033763209559167064\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6444444444444445,\n\ \ \"acc_stderr\": 0.04135176749720385,\n \"acc_norm\": 0.6444444444444445,\n\ \ \"acc_norm_stderr\": 0.04135176749720385\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.7105263157894737,\n \"acc_stderr\": 0.03690677986137283,\n\ \ \"acc_norm\": 0.7105263157894737,\n \"acc_norm_stderr\": 0.03690677986137283\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.65,\n\ \ \"acc_stderr\": 0.0479372485441102,\n \"acc_norm\": 0.65,\n \ \ \"acc_norm_stderr\": 0.0479372485441102\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7132075471698113,\n \"acc_stderr\": 0.02783491252754406,\n\ \ \"acc_norm\": 0.7132075471698113,\n \"acc_norm_stderr\": 0.02783491252754406\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7638888888888888,\n\ \ \"acc_stderr\": 0.03551446610810826,\n \"acc_norm\": 0.7638888888888888,\n\ \ \"acc_norm_stderr\": 0.03551446610810826\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.51,\n \"acc_stderr\": 0.05024183937956912,\n \ \ \"acc_norm\": 0.51,\n \"acc_norm_stderr\": 0.05024183937956912\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.57,\n \"acc_stderr\": 0.04975698519562428,\n \"acc_norm\": 0.57,\n\ \ \"acc_norm_stderr\": 0.04975698519562428\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.3,\n \"acc_stderr\": 0.046056618647183814,\n \ \ \"acc_norm\": 0.3,\n \"acc_norm_stderr\": 0.046056618647183814\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6647398843930635,\n\ \ \"acc_stderr\": 0.03599586301247077,\n \"acc_norm\": 0.6647398843930635,\n\ \ \"acc_norm_stderr\": 0.03599586301247077\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.43137254901960786,\n \"acc_stderr\": 0.04928099597287533,\n\ \ \"acc_norm\": 0.43137254901960786,\n \"acc_norm_stderr\": 0.04928099597287533\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.76,\n \"acc_stderr\": 0.04292346959909282,\n \"acc_norm\": 0.76,\n\ \ \"acc_norm_stderr\": 0.04292346959909282\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.5914893617021276,\n \"acc_stderr\": 0.032134180267015755,\n\ \ \"acc_norm\": 0.5914893617021276,\n \"acc_norm_stderr\": 0.032134180267015755\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.5,\n\ \ \"acc_stderr\": 0.047036043419179864,\n \"acc_norm\": 0.5,\n \ \ \"acc_norm_stderr\": 0.047036043419179864\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5724137931034483,\n \"acc_stderr\": 0.04122737111370333,\n\ \ \"acc_norm\": 0.5724137931034483,\n \"acc_norm_stderr\": 0.04122737111370333\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.4470899470899471,\n \"acc_stderr\": 0.025606723995777025,\n \"\ acc_norm\": 0.4470899470899471,\n \"acc_norm_stderr\": 0.025606723995777025\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4523809523809524,\n\ \ \"acc_stderr\": 0.044518079590553275,\n \"acc_norm\": 0.4523809523809524,\n\ \ \"acc_norm_stderr\": 0.044518079590553275\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.36,\n \"acc_stderr\": 0.048241815132442176,\n \ \ \"acc_norm\": 0.36,\n \"acc_norm_stderr\": 0.048241815132442176\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\"\ : 0.7774193548387097,\n \"acc_stderr\": 0.023664216671642518,\n \"\ acc_norm\": 0.7774193548387097,\n \"acc_norm_stderr\": 0.023664216671642518\n\ \ },\n \"harness|hendrycksTest-high_school_chemistry|5\": {\n \"acc\"\ : 0.4876847290640394,\n \"acc_stderr\": 0.035169204442208966,\n \"\ acc_norm\": 0.4876847290640394,\n \"acc_norm_stderr\": 0.035169204442208966\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.7,\n \"acc_stderr\": 0.046056618647183814,\n \"acc_norm\"\ : 0.7,\n \"acc_norm_stderr\": 0.046056618647183814\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7818181818181819,\n \"acc_stderr\": 0.03225078108306289,\n\ \ \"acc_norm\": 0.7818181818181819,\n \"acc_norm_stderr\": 0.03225078108306289\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7929292929292929,\n \"acc_stderr\": 0.028869778460267045,\n \"\ acc_norm\": 0.7929292929292929,\n \"acc_norm_stderr\": 0.028869778460267045\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.9067357512953368,\n \"acc_stderr\": 0.02098685459328973,\n\ \ \"acc_norm\": 0.9067357512953368,\n \"acc_norm_stderr\": 0.02098685459328973\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6615384615384615,\n \"acc_stderr\": 0.023991500500313036,\n\ \ \"acc_norm\": 0.6615384615384615,\n \"acc_norm_stderr\": 0.023991500500313036\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.3296296296296296,\n \"acc_stderr\": 0.028661201116524565,\n \ \ \"acc_norm\": 0.3296296296296296,\n \"acc_norm_stderr\": 0.028661201116524565\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6848739495798319,\n \"acc_stderr\": 0.030176808288974337,\n\ \ \"acc_norm\": 0.6848739495798319,\n \"acc_norm_stderr\": 0.030176808288974337\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.33112582781456956,\n \"acc_stderr\": 0.038425817186598696,\n \"\ acc_norm\": 0.33112582781456956,\n \"acc_norm_stderr\": 0.038425817186598696\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8513761467889909,\n \"acc_stderr\": 0.015251253773660831,\n \"\ acc_norm\": 0.8513761467889909,\n \"acc_norm_stderr\": 0.015251253773660831\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.5277777777777778,\n \"acc_stderr\": 0.0340470532865388,\n \"acc_norm\"\ : 0.5277777777777778,\n \"acc_norm_stderr\": 0.0340470532865388\n },\n\ \ \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\": 0.803921568627451,\n\ \ \"acc_stderr\": 0.027865942286639318,\n \"acc_norm\": 0.803921568627451,\n\ \ \"acc_norm_stderr\": 0.027865942286639318\n },\n \"harness|hendrycksTest-high_school_world_history|5\"\ : {\n \"acc\": 0.8016877637130801,\n \"acc_stderr\": 0.02595502084162113,\n\ \ \"acc_norm\": 0.8016877637130801,\n \"acc_norm_stderr\": 0.02595502084162113\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6905829596412556,\n\ \ \"acc_stderr\": 0.03102441174057221,\n \"acc_norm\": 0.6905829596412556,\n\ \ \"acc_norm_stderr\": 0.03102441174057221\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7786259541984732,\n \"acc_stderr\": 0.03641297081313729,\n\ \ \"acc_norm\": 0.7786259541984732,\n \"acc_norm_stderr\": 0.03641297081313729\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7851239669421488,\n \"acc_stderr\": 0.037494924487096966,\n \"\ acc_norm\": 0.7851239669421488,\n \"acc_norm_stderr\": 0.037494924487096966\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7685185185185185,\n\ \ \"acc_stderr\": 0.04077494709252626,\n \"acc_norm\": 0.7685185185185185,\n\ \ \"acc_norm_stderr\": 0.04077494709252626\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7730061349693251,\n \"acc_stderr\": 0.03291099578615769,\n\ \ \"acc_norm\": 0.7730061349693251,\n \"acc_norm_stderr\": 0.03291099578615769\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.4642857142857143,\n\ \ \"acc_stderr\": 0.04733667890053756,\n \"acc_norm\": 0.4642857142857143,\n\ \ \"acc_norm_stderr\": 0.04733667890053756\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7766990291262136,\n \"acc_stderr\": 0.04123553189891431,\n\ \ \"acc_norm\": 0.7766990291262136,\n \"acc_norm_stderr\": 0.04123553189891431\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8974358974358975,\n\ \ \"acc_stderr\": 0.01987565502786744,\n \"acc_norm\": 0.8974358974358975,\n\ \ \"acc_norm_stderr\": 0.01987565502786744\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.71,\n \"acc_stderr\": 0.045604802157206845,\n \ \ \"acc_norm\": 0.71,\n \"acc_norm_stderr\": 0.045604802157206845\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8301404853128991,\n\ \ \"acc_stderr\": 0.013428186370608306,\n \"acc_norm\": 0.8301404853128991,\n\ \ \"acc_norm_stderr\": 0.013428186370608306\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7485549132947977,\n \"acc_stderr\": 0.02335736578587403,\n\ \ \"acc_norm\": 0.7485549132947977,\n \"acc_norm_stderr\": 0.02335736578587403\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.43687150837988825,\n\ \ \"acc_stderr\": 0.016588680864530626,\n \"acc_norm\": 0.43687150837988825,\n\ \ \"acc_norm_stderr\": 0.016588680864530626\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7352941176470589,\n \"acc_stderr\": 0.02526169121972948,\n\ \ \"acc_norm\": 0.7352941176470589,\n \"acc_norm_stderr\": 0.02526169121972948\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7009646302250804,\n\ \ \"acc_stderr\": 0.02600330111788514,\n \"acc_norm\": 0.7009646302250804,\n\ \ \"acc_norm_stderr\": 0.02600330111788514\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7407407407407407,\n \"acc_stderr\": 0.02438366553103545,\n\ \ \"acc_norm\": 0.7407407407407407,\n \"acc_norm_stderr\": 0.02438366553103545\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.46099290780141844,\n \"acc_stderr\": 0.029736592526424438,\n \ \ \"acc_norm\": 0.46099290780141844,\n \"acc_norm_stderr\": 0.029736592526424438\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4589308996088657,\n\ \ \"acc_stderr\": 0.012727084826799798,\n \"acc_norm\": 0.4589308996088657,\n\ \ \"acc_norm_stderr\": 0.012727084826799798\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.6838235294117647,\n \"acc_stderr\": 0.028245687391462923,\n\ \ \"acc_norm\": 0.6838235294117647,\n \"acc_norm_stderr\": 0.028245687391462923\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6715686274509803,\n \"acc_stderr\": 0.018999707383162673,\n \ \ \"acc_norm\": 0.6715686274509803,\n \"acc_norm_stderr\": 0.018999707383162673\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6818181818181818,\n\ \ \"acc_stderr\": 0.04461272175910509,\n \"acc_norm\": 0.6818181818181818,\n\ \ \"acc_norm_stderr\": 0.04461272175910509\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7673469387755102,\n \"acc_stderr\": 0.02704925791589618,\n\ \ \"acc_norm\": 0.7673469387755102,\n \"acc_norm_stderr\": 0.02704925791589618\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8656716417910447,\n\ \ \"acc_stderr\": 0.02411267824090083,\n \"acc_norm\": 0.8656716417910447,\n\ \ \"acc_norm_stderr\": 0.02411267824090083\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.82,\n \"acc_stderr\": 0.038612291966536934,\n \ \ \"acc_norm\": 0.82,\n \"acc_norm_stderr\": 0.038612291966536934\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.536144578313253,\n\ \ \"acc_stderr\": 0.03882310850890594,\n \"acc_norm\": 0.536144578313253,\n\ \ \"acc_norm_stderr\": 0.03882310850890594\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8245614035087719,\n \"acc_stderr\": 0.02917088550072767,\n\ \ \"acc_norm\": 0.8245614035087719,\n \"acc_norm_stderr\": 0.02917088550072767\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.5055079559363526,\n\ \ \"mc1_stderr\": 0.01750243899045107,\n \"mc2\": 0.6576763693172452,\n\ \ \"mc2_stderr\": 0.01500859930650817\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.8113654301499605,\n \"acc_stderr\": 0.010995172318019813\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.7126611068991661,\n \ \ \"acc_stderr\": 0.012464677060107081\n }\n}\n```" repo_url: https://huggingface.co/cloudyu/Mixtral_7Bx6_MoE_35B leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|arc:challenge|25_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|arc:challenge|25_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-01-14T16-00-09.048254.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|gsm8k|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|gsm8k|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hellaswag|10_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hellaswag|10_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-12T00-20-46.590520.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-14T16-00-09.048254.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-management|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-management|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-14T16-00-09.048254.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|truthfulqa:mc|0_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|truthfulqa:mc|0_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-01-14T16-00-09.048254.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_01_12T00_20_46.590520 path: - '**/details_harness|winogrande|5_2024-01-12T00-20-46.590520.parquet' - split: 2024_01_14T16_00_09.048254 path: - '**/details_harness|winogrande|5_2024-01-14T16-00-09.048254.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-01-14T16-00-09.048254.parquet' - config_name: results data_files: - split: 2024_01_12T00_20_46.590520 path: - results_2024-01-12T00-20-46.590520.parquet - split: 2024_01_14T16_00_09.048254 path: - results_2024-01-14T16-00-09.048254.parquet - split: latest path: - results_2024-01-14T16-00-09.048254.parquet --- # Dataset Card for Evaluation run of cloudyu/Mixtral_7Bx6_MoE_35B <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [cloudyu/Mixtral_7Bx6_MoE_35B](https://huggingface.co/cloudyu/Mixtral_7Bx6_MoE_35B) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_cloudyu__Mixtral_7Bx6_MoE_35B", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-01-14T16:00:09.048254](https://huggingface.co/datasets/open-llm-leaderboard/details_cloudyu__Mixtral_7Bx6_MoE_35B/blob/main/results_2024-01-14T16-00-09.048254.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6538307305259115, "acc_stderr": 0.03206532838135927, "acc_norm": 0.6536540314559122, "acc_norm_stderr": 0.03272839976259325, "mc1": 0.5055079559363526, "mc1_stderr": 0.01750243899045107, "mc2": 0.6576763693172452, "mc2_stderr": 0.01500859930650817 }, "harness|arc:challenge|25": { "acc": 0.674061433447099, "acc_stderr": 0.013697432466693246, "acc_norm": 0.6996587030716723, "acc_norm_stderr": 0.013395909309957005 }, "harness|hellaswag|10": { "acc": 0.6878111929894444, "acc_stderr": 0.0046243936909669, "acc_norm": 0.8681537542322246, "acc_norm_stderr": 0.0033763209559167064 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6444444444444445, "acc_stderr": 0.04135176749720385, "acc_norm": 0.6444444444444445, "acc_norm_stderr": 0.04135176749720385 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7105263157894737, "acc_stderr": 0.03690677986137283, "acc_norm": 0.7105263157894737, "acc_norm_stderr": 0.03690677986137283 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.65, "acc_stderr": 0.0479372485441102, "acc_norm": 0.65, "acc_norm_stderr": 0.0479372485441102 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7132075471698113, "acc_stderr": 0.02783491252754406, "acc_norm": 0.7132075471698113, "acc_norm_stderr": 0.02783491252754406 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7638888888888888, "acc_stderr": 0.03551446610810826, "acc_norm": 0.7638888888888888, "acc_norm_stderr": 0.03551446610810826 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.51, "acc_stderr": 0.05024183937956912, "acc_norm": 0.51, "acc_norm_stderr": 0.05024183937956912 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.57, "acc_stderr": 0.04975698519562428, "acc_norm": 0.57, "acc_norm_stderr": 0.04975698519562428 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6647398843930635, "acc_stderr": 0.03599586301247077, "acc_norm": 0.6647398843930635, "acc_norm_stderr": 0.03599586301247077 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.43137254901960786, "acc_stderr": 0.04928099597287533, "acc_norm": 0.43137254901960786, "acc_norm_stderr": 0.04928099597287533 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.76, "acc_stderr": 0.04292346959909282, "acc_norm": 0.76, "acc_norm_stderr": 0.04292346959909282 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5914893617021276, "acc_stderr": 0.032134180267015755, "acc_norm": 0.5914893617021276, "acc_norm_stderr": 0.032134180267015755 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5, "acc_stderr": 0.047036043419179864, "acc_norm": 0.5, "acc_norm_stderr": 0.047036043419179864 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5724137931034483, "acc_stderr": 0.04122737111370333, "acc_norm": 0.5724137931034483, "acc_norm_stderr": 0.04122737111370333 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.4470899470899471, "acc_stderr": 0.025606723995777025, "acc_norm": 0.4470899470899471, "acc_norm_stderr": 0.025606723995777025 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4523809523809524, "acc_stderr": 0.044518079590553275, "acc_norm": 0.4523809523809524, "acc_norm_stderr": 0.044518079590553275 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.36, "acc_stderr": 0.048241815132442176, "acc_norm": 0.36, "acc_norm_stderr": 0.048241815132442176 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7774193548387097, "acc_stderr": 0.023664216671642518, "acc_norm": 0.7774193548387097, "acc_norm_stderr": 0.023664216671642518 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.4876847290640394, "acc_stderr": 0.035169204442208966, "acc_norm": 0.4876847290640394, "acc_norm_stderr": 0.035169204442208966 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7818181818181819, "acc_stderr": 0.03225078108306289, "acc_norm": 0.7818181818181819, "acc_norm_stderr": 0.03225078108306289 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7929292929292929, "acc_stderr": 0.028869778460267045, "acc_norm": 0.7929292929292929, "acc_norm_stderr": 0.028869778460267045 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.9067357512953368, "acc_stderr": 0.02098685459328973, "acc_norm": 0.9067357512953368, "acc_norm_stderr": 0.02098685459328973 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6615384615384615, "acc_stderr": 0.023991500500313036, "acc_norm": 0.6615384615384615, "acc_norm_stderr": 0.023991500500313036 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3296296296296296, "acc_stderr": 0.028661201116524565, "acc_norm": 0.3296296296296296, "acc_norm_stderr": 0.028661201116524565 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6848739495798319, "acc_stderr": 0.030176808288974337, "acc_norm": 0.6848739495798319, "acc_norm_stderr": 0.030176808288974337 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.33112582781456956, "acc_stderr": 0.038425817186598696, "acc_norm": 0.33112582781456956, "acc_norm_stderr": 0.038425817186598696 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8513761467889909, "acc_stderr": 0.015251253773660831, "acc_norm": 0.8513761467889909, "acc_norm_stderr": 0.015251253773660831 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5277777777777778, "acc_stderr": 0.0340470532865388, "acc_norm": 0.5277777777777778, "acc_norm_stderr": 0.0340470532865388 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.803921568627451, "acc_stderr": 0.027865942286639318, "acc_norm": 0.803921568627451, "acc_norm_stderr": 0.027865942286639318 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.8016877637130801, "acc_stderr": 0.02595502084162113, "acc_norm": 0.8016877637130801, "acc_norm_stderr": 0.02595502084162113 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6905829596412556, "acc_stderr": 0.03102441174057221, "acc_norm": 0.6905829596412556, "acc_norm_stderr": 0.03102441174057221 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7786259541984732, "acc_stderr": 0.03641297081313729, "acc_norm": 0.7786259541984732, "acc_norm_stderr": 0.03641297081313729 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7851239669421488, "acc_stderr": 0.037494924487096966, "acc_norm": 0.7851239669421488, "acc_norm_stderr": 0.037494924487096966 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7685185185185185, "acc_stderr": 0.04077494709252626, "acc_norm": 0.7685185185185185, "acc_norm_stderr": 0.04077494709252626 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7730061349693251, "acc_stderr": 0.03291099578615769, "acc_norm": 0.7730061349693251, "acc_norm_stderr": 0.03291099578615769 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.4642857142857143, "acc_stderr": 0.04733667890053756, "acc_norm": 0.4642857142857143, "acc_norm_stderr": 0.04733667890053756 }, "harness|hendrycksTest-management|5": { "acc": 0.7766990291262136, "acc_stderr": 0.04123553189891431, "acc_norm": 0.7766990291262136, "acc_norm_stderr": 0.04123553189891431 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8974358974358975, "acc_stderr": 0.01987565502786744, "acc_norm": 0.8974358974358975, "acc_norm_stderr": 0.01987565502786744 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.71, "acc_stderr": 0.045604802157206845, "acc_norm": 0.71, "acc_norm_stderr": 0.045604802157206845 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8301404853128991, "acc_stderr": 0.013428186370608306, "acc_norm": 0.8301404853128991, "acc_norm_stderr": 0.013428186370608306 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7485549132947977, "acc_stderr": 0.02335736578587403, "acc_norm": 0.7485549132947977, "acc_norm_stderr": 0.02335736578587403 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.43687150837988825, "acc_stderr": 0.016588680864530626, "acc_norm": 0.43687150837988825, "acc_norm_stderr": 0.016588680864530626 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7352941176470589, "acc_stderr": 0.02526169121972948, "acc_norm": 0.7352941176470589, "acc_norm_stderr": 0.02526169121972948 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7009646302250804, "acc_stderr": 0.02600330111788514, "acc_norm": 0.7009646302250804, "acc_norm_stderr": 0.02600330111788514 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7407407407407407, "acc_stderr": 0.02438366553103545, "acc_norm": 0.7407407407407407, "acc_norm_stderr": 0.02438366553103545 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.46099290780141844, "acc_stderr": 0.029736592526424438, "acc_norm": 0.46099290780141844, "acc_norm_stderr": 0.029736592526424438 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.4589308996088657, "acc_stderr": 0.012727084826799798, "acc_norm": 0.4589308996088657, "acc_norm_stderr": 0.012727084826799798 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6838235294117647, "acc_stderr": 0.028245687391462923, "acc_norm": 0.6838235294117647, "acc_norm_stderr": 0.028245687391462923 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6715686274509803, "acc_stderr": 0.018999707383162673, "acc_norm": 0.6715686274509803, "acc_norm_stderr": 0.018999707383162673 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6818181818181818, "acc_stderr": 0.04461272175910509, "acc_norm": 0.6818181818181818, "acc_norm_stderr": 0.04461272175910509 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7673469387755102, "acc_stderr": 0.02704925791589618, "acc_norm": 0.7673469387755102, "acc_norm_stderr": 0.02704925791589618 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8656716417910447, "acc_stderr": 0.02411267824090083, "acc_norm": 0.8656716417910447, "acc_norm_stderr": 0.02411267824090083 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.82, "acc_stderr": 0.038612291966536934, "acc_norm": 0.82, "acc_norm_stderr": 0.038612291966536934 }, "harness|hendrycksTest-virology|5": { "acc": 0.536144578313253, "acc_stderr": 0.03882310850890594, "acc_norm": 0.536144578313253, "acc_norm_stderr": 0.03882310850890594 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8245614035087719, "acc_stderr": 0.02917088550072767, "acc_norm": 0.8245614035087719, "acc_norm_stderr": 0.02917088550072767 }, "harness|truthfulqa:mc|0": { "mc1": 0.5055079559363526, "mc1_stderr": 0.01750243899045107, "mc2": 0.6576763693172452, "mc2_stderr": 0.01500859930650817 }, "harness|winogrande|5": { "acc": 0.8113654301499605, "acc_stderr": 0.010995172318019813 }, "harness|gsm8k|5": { "acc": 0.7126611068991661, "acc_stderr": 0.012464677060107081 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

该数据集是在评估模型cloudyu/Mixtral_7Bx6_MoE_35BOpen LLM Leaderboard上的自动创建的。数据集包含63个配置,每个配置对应一个评估任务。

数据集结构

数据集由2次运行创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。"train"分割始终指向最新的结果。

额外配置

一个额外的配置"results"存储了所有运行结果的聚合,用于计算和显示在Open LLM Leaderboard上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_cloudyu__Mixtral_7Bx6_MoE_35B", "harness_winogrande_5", split="train")

最新结果

以下是最新结果(来自2024-01-14T16:00:09.048254运行)的示例:

python { "all": { "acc": 0.6538307305259115, "acc_stderr": 0.03206532838135927, "acc_norm": 0.6536540314559122, "acc_norm_stderr": 0.03272839976259325, "mc1": 0.5055079559363526, "mc1_stderr": 0.01750243899045107, "mc2": 0.6576763693172452, "mc2_stderr": 0.01500859930650817 }, "harness|arc:challenge|25": { "acc": 0.674061433447099, "acc_stderr": 0.013697432466693246, "acc_norm": 0.6996587030716723, "acc_norm_stderr": 0.013395909309957005 }, "harness|hellaswag|10": { "acc": 0.6878111929894444, "acc_stderr": 0.0046243936909669, "acc_norm": 0.8681537542322246, "acc_norm_stderr": 0.0033763209559167064 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, ... }

配置详情

  • config_name: harness_arc_challenge_25

    • data_files:
      • split: 2024_01_12T00_20_46.590520
        • path: **/details_harness|arc:challenge|25_2024-01-12T00-20-46.590520.parquet
      • split: 2024_01_14T16_00_09.048254
        • path: **/details_harness|arc:challenge|25_2024-01-14T16-00-09.048254.parquet
      • split: latest
        • path: **/details_harness|arc:challenge|25_2024-01-14T16-00-09.048254.parquet
  • config_name: harness_gsm8k_5

    • data_files:
      • split: 2024_01_12T00_20_46.590520
        • path: **/details_harness|gsm8k|5_2024-01-12T00-20-46.590520.parquet
      • split: 2024_01_14T16_00_09.048254
        • path: **/details_harness|gsm8k|5_2024-01-14T16-00-09.048254.parquet
      • split: latest
        • path: **/details_harness|gsm8k|5_2024-01-14T16-00-09.048254.parquet
  • config_name: harness_hellaswag_10

    • data_files:
      • split: 2024_01_12T00_20_46.590520
        • path: **/details_harness|hellaswag|10_2024-01-12T00-20-46.590520.parquet
      • split: 2024_01_14T16_00_09.048254
        • path: **/details_harness|hellaswag|10_2024-01-14T16-00-09.048254.parquet
      • split: latest
        • path: **/details_harness|hellaswag|10_2024-01-14T16-00-09.048254.parquet
  • config_name: harness_hendrycksTest_5

    • data_files:
      • split: 2024_01_12T00_20_46.590520
        • path:
          • **/details_harness|hendrycksTest-abstract_algebra|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-anatomy|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-astronomy|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-business_ethics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-college_biology|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-college_chemistry|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-college_computer_science|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-college_mathematics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-college_medicine|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-college_physics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-computer_security|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-conceptual_physics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-econometrics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-electrical_engineering|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-formal_logic|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-global_facts|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_biology|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_european_history|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_geography|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_physics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_psychology|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_statistics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_us_history|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-high_school_world_history|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-human_aging|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-human_sexuality|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-international_law|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-jurisprudence|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-logical_fallacies|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-machine_learning|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-management|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-marketing|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-medical_genetics|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-miscellaneous|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest-moral_disputes|5_2024-01-12T00-20-46.590520.parquet
          • **/details_harness|hendrycksTest
搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,该数据集作为开放大语言模型排行榜的衍生成果,其构建过程体现了自动化与标准化的结合。数据集在模型cloudyu/Mixtral_7Bx6_MoE_35B的评估运行中自动生成,涵盖了63种不同的评测任务配置。每一次独立的评估运行均被记录为一个特定的数据切分,并以时间戳命名,而“train”切分始终指向最新的评估结果。此外,一个名为“results”的独立配置汇总了所有运行的聚合指标,为排行榜的综合性度量提供了数据基础。
特点
该数据集的核心特征在于其详尽的评估记录与结构化存储。它完整保留了模型在多样化基准测试中的详细输出与性能指标,覆盖了从常识推理、知识问答到专业学科等多个维度的任务。数据以配置和切分的双重结构进行组织,使得用户既能追溯历史评估的完整轨迹,又能便捷地获取最新的结果。这种设计不仅确保了数据的可追溯性与版本管理,也为深入的模型能力分析与对比研究提供了丰富且规范的素材。
使用方法
利用该数据集进行模型性能分析,需借助Hugging Face的datasets库。用户可通过指定数据集名称、具体任务配置(如“harness_winogrande_5”)以及所需的切分(如“train”或特定时间戳)来加载相应的评估细节数据。加载后的数据结构化地包含了模型在该任务上的预测结果与准确率等度量,便于进行后续的统计分析或可视化。对于希望复现结果或进行跨模型、跨任务比较的研究者而言,这种清晰的数据接口提供了极大的便利。
背景与挑战
背景概述
在大型语言模型(LLM)快速发展的浪潮中,评估其综合能力成为推动技术进步的关键环节。HuggingFace平台推出的Open LLM Leaderboard正是响应这一需求,旨在通过标准化基准测试对各类开源模型进行系统化评估与排名。数据集‘open-llm-leaderboard-old/details_cloudyu__Mixtral_7Bx6_MoE_35B’作为该排行榜的一部分,专门记录了由研究者cloudyu开发的Mixtral 7Bx6 MoE 35B模型在2024年初的多轮评估详情。该数据集由HuggingFace团队维护,核心研究问题聚焦于如何客观、全面地衡量混合专家(MoE)架构模型在常识推理、专业学科知识及数学解题等多样化任务上的性能表现,为模型比较与优化提供了宝贵的数据支撑,显著促进了开源LLM生态的透明化与协作创新。
当前挑战
该数据集所应对的领域挑战在于,大型语言模型的评估需覆盖极其广泛的知识领域与认知能力,从基础常识到专业学科,从语言理解到数学推理,单一基准难以全面反映模型真实水平。构建过程中的挑战则体现在技术层面:如何自动化地整合来自ARC、HellaSwag、MMLU(HendrycksTest)及GSM8K等多个异构评估任务的海量细节数据,并确保每次评估运行的结果能够被准确记录、版本化管理且便于社区访问与复现。此外,随着模型迭代与评估标准的演进,保持数据集结构的灵活性以容纳新增任务或变更的评估协议,也是一项持续性的工程挑战。
常用场景
经典使用场景
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard的评估结果记录,其经典使用场景在于为研究人员提供Mixtral 7Bx6 MoE 35B模型在多样化基准测试中的详尽性能数据。通过涵盖ARC挑战赛、HellaSwag、MMLU以及TruthfulQA等多个权威评测任务,数据集允许学者深入分析模型在常识推理、语言理解、专业知识及真实性等方面的表现,为模型能力的横向对比与纵向追踪奠定数据基础。
衍生相关工作
围绕该数据集衍生的经典工作主要集中于模型能力分析与评估方法创新。研究者利用其细粒度结果开展了针对混合专家模型缩放律、知识泛化特性以及多任务学习效率的深入探究。同时,这些数据也催生了新的评估范式,例如结合多个子任务结果进行模型鲁棒性综合评分,或驱动了针对模型在伦理、安全等维度评估基准的补充与完善,持续丰富着大模型评估的生态系统。
数据集最近研究
最新研究方向
在大型语言模型评估领域,Mixtral 7Bx6 MoE 35B模型的评测数据集揭示了稀疏专家混合架构的前沿探索方向。该数据集通过Open LLM Leaderboard的多维度基准测试,展现了模型在常识推理、专业学科知识及数学能力等复杂任务上的性能表现。当前研究聚焦于如何利用此类细粒度评估结果优化MoE模型的专家路由机制,特别是在处理高阶数学推理与专业领域知识时出现的性能波动现象。随着社区对模型透明度和可解释性需求的提升,这类详细评测数据正推动着动态负载均衡算法与跨领域知识迁移机制的研究热潮,为构建更高效可靠的稀疏化大语言模型提供了关键实证基础。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作