five

open-llm-leaderboard-old/details_fzzhang__Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s

收藏
Hugging Face2024-02-17 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_fzzhang__Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of fzzhang/Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [fzzhang/Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s](https://huggingface.co/fzzhang/Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_fzzhang__Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-02-17T00:25:52.922442](https://huggingface.co/datasets/open-llm-leaderboard/details_fzzhang__Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s/blob/main/results_2024-02-17T00-25-52.922442.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6135127589001832,\n\ \ \"acc_stderr\": 0.032796760940438034,\n \"acc_norm\": 0.6157670560754505,\n\ \ \"acc_norm_stderr\": 0.033451817306662635,\n \"mc1\": 0.37454100367197063,\n\ \ \"mc1_stderr\": 0.016943535128405327,\n \"mc2\": 0.5477195184186756,\n\ \ \"mc2_stderr\": 0.015358664393160576\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.5930034129692833,\n \"acc_stderr\": 0.01435639941800912,\n\ \ \"acc_norm\": 0.6407849829351536,\n \"acc_norm_stderr\": 0.014020224155839162\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6493726349332802,\n\ \ \"acc_stderr\": 0.00476191251170751,\n \"acc_norm\": 0.841167098187612,\n\ \ \"acc_norm_stderr\": 0.003647731723938848\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6074074074074074,\n\ \ \"acc_stderr\": 0.0421850621536888,\n \"acc_norm\": 0.6074074074074074,\n\ \ \"acc_norm_stderr\": 0.0421850621536888\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.6447368421052632,\n \"acc_stderr\": 0.03894734487013316,\n\ \ \"acc_norm\": 0.6447368421052632,\n \"acc_norm_stderr\": 0.03894734487013316\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.6,\n\ \ \"acc_stderr\": 0.04923659639173309,\n \"acc_norm\": 0.6,\n \ \ \"acc_norm_stderr\": 0.04923659639173309\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.690566037735849,\n \"acc_stderr\": 0.028450154794118637,\n\ \ \"acc_norm\": 0.690566037735849,\n \"acc_norm_stderr\": 0.028450154794118637\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7569444444444444,\n\ \ \"acc_stderr\": 0.03586879280080341,\n \"acc_norm\": 0.7569444444444444,\n\ \ \"acc_norm_stderr\": 0.03586879280080341\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.45,\n \"acc_stderr\": 0.05,\n \"acc_norm\"\ : 0.45,\n \"acc_norm_stderr\": 0.05\n },\n \"harness|hendrycksTest-college_computer_science|5\"\ : {\n \"acc\": 0.48,\n \"acc_stderr\": 0.050211673156867795,\n \ \ \"acc_norm\": 0.48,\n \"acc_norm_stderr\": 0.050211673156867795\n \ \ },\n \"harness|hendrycksTest-college_mathematics|5\": {\n \"acc\"\ : 0.24,\n \"acc_stderr\": 0.04292346959909282,\n \"acc_norm\": 0.24,\n\ \ \"acc_norm_stderr\": 0.04292346959909282\n },\n \"harness|hendrycksTest-college_medicine|5\"\ : {\n \"acc\": 0.630057803468208,\n \"acc_stderr\": 0.0368122963339432,\n\ \ \"acc_norm\": 0.630057803468208,\n \"acc_norm_stderr\": 0.0368122963339432\n\ \ },\n \"harness|hendrycksTest-college_physics|5\": {\n \"acc\": 0.3333333333333333,\n\ \ \"acc_stderr\": 0.04690650298201942,\n \"acc_norm\": 0.3333333333333333,\n\ \ \"acc_norm_stderr\": 0.04690650298201942\n },\n \"harness|hendrycksTest-computer_security|5\"\ : {\n \"acc\": 0.73,\n \"acc_stderr\": 0.044619604333847394,\n \ \ \"acc_norm\": 0.73,\n \"acc_norm_stderr\": 0.044619604333847394\n \ \ },\n \"harness|hendrycksTest-conceptual_physics|5\": {\n \"acc\":\ \ 0.5361702127659574,\n \"acc_stderr\": 0.03260038511835771,\n \"\ acc_norm\": 0.5361702127659574,\n \"acc_norm_stderr\": 0.03260038511835771\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.4298245614035088,\n\ \ \"acc_stderr\": 0.046570472605949625,\n \"acc_norm\": 0.4298245614035088,\n\ \ \"acc_norm_stderr\": 0.046570472605949625\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.503448275862069,\n \"acc_stderr\": 0.04166567577101579,\n\ \ \"acc_norm\": 0.503448275862069,\n \"acc_norm_stderr\": 0.04166567577101579\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.3968253968253968,\n \"acc_stderr\": 0.025197101074246483,\n \"\ acc_norm\": 0.3968253968253968,\n \"acc_norm_stderr\": 0.025197101074246483\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4365079365079365,\n\ \ \"acc_stderr\": 0.04435932892851466,\n \"acc_norm\": 0.4365079365079365,\n\ \ \"acc_norm_stderr\": 0.04435932892851466\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.3,\n \"acc_stderr\": 0.046056618647183814,\n \ \ \"acc_norm\": 0.3,\n \"acc_norm_stderr\": 0.046056618647183814\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7354838709677419,\n\ \ \"acc_stderr\": 0.02509189237885928,\n \"acc_norm\": 0.7354838709677419,\n\ \ \"acc_norm_stderr\": 0.02509189237885928\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.4630541871921182,\n \"acc_stderr\": 0.035083705204426656,\n\ \ \"acc_norm\": 0.4630541871921182,\n \"acc_norm_stderr\": 0.035083705204426656\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.6,\n \"acc_stderr\": 0.049236596391733084,\n \"acc_norm\"\ : 0.6,\n \"acc_norm_stderr\": 0.049236596391733084\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7454545454545455,\n \"acc_stderr\": 0.03401506715249039,\n\ \ \"acc_norm\": 0.7454545454545455,\n \"acc_norm_stderr\": 0.03401506715249039\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7171717171717171,\n \"acc_stderr\": 0.03208779558786751,\n \"\ acc_norm\": 0.7171717171717171,\n \"acc_norm_stderr\": 0.03208779558786751\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.8549222797927462,\n \"acc_stderr\": 0.025416343096306422,\n\ \ \"acc_norm\": 0.8549222797927462,\n \"acc_norm_stderr\": 0.025416343096306422\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6076923076923076,\n \"acc_stderr\": 0.024756000382130952,\n\ \ \"acc_norm\": 0.6076923076923076,\n \"acc_norm_stderr\": 0.024756000382130952\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.34074074074074073,\n \"acc_stderr\": 0.028897748741131147,\n \ \ \"acc_norm\": 0.34074074074074073,\n \"acc_norm_stderr\": 0.028897748741131147\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6428571428571429,\n \"acc_stderr\": 0.031124619309328177,\n\ \ \"acc_norm\": 0.6428571428571429,\n \"acc_norm_stderr\": 0.031124619309328177\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.304635761589404,\n \"acc_stderr\": 0.03757949922943343,\n \"acc_norm\"\ : 0.304635761589404,\n \"acc_norm_stderr\": 0.03757949922943343\n },\n\ \ \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\": 0.8165137614678899,\n\ \ \"acc_stderr\": 0.0165952597103993,\n \"acc_norm\": 0.8165137614678899,\n\ \ \"acc_norm_stderr\": 0.0165952597103993\n },\n \"harness|hendrycksTest-high_school_statistics|5\"\ : {\n \"acc\": 0.5046296296296297,\n \"acc_stderr\": 0.03409825519163572,\n\ \ \"acc_norm\": 0.5046296296296297,\n \"acc_norm_stderr\": 0.03409825519163572\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.7647058823529411,\n \"acc_stderr\": 0.029771775228145635,\n \"\ acc_norm\": 0.7647058823529411,\n \"acc_norm_stderr\": 0.029771775228145635\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7805907172995781,\n \"acc_stderr\": 0.026939106581553945,\n \ \ \"acc_norm\": 0.7805907172995781,\n \"acc_norm_stderr\": 0.026939106581553945\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6367713004484304,\n\ \ \"acc_stderr\": 0.032277904428505,\n \"acc_norm\": 0.6367713004484304,\n\ \ \"acc_norm_stderr\": 0.032277904428505\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7404580152671756,\n \"acc_stderr\": 0.03844876139785271,\n\ \ \"acc_norm\": 0.7404580152671756,\n \"acc_norm_stderr\": 0.03844876139785271\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7272727272727273,\n \"acc_stderr\": 0.04065578140908705,\n \"\ acc_norm\": 0.7272727272727273,\n \"acc_norm_stderr\": 0.04065578140908705\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7685185185185185,\n\ \ \"acc_stderr\": 0.04077494709252627,\n \"acc_norm\": 0.7685185185185185,\n\ \ \"acc_norm_stderr\": 0.04077494709252627\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7055214723926381,\n \"acc_stderr\": 0.03581165790474082,\n\ \ \"acc_norm\": 0.7055214723926381,\n \"acc_norm_stderr\": 0.03581165790474082\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.5535714285714286,\n\ \ \"acc_stderr\": 0.047184714852195865,\n \"acc_norm\": 0.5535714285714286,\n\ \ \"acc_norm_stderr\": 0.047184714852195865\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.8252427184466019,\n \"acc_stderr\": 0.03760178006026622,\n\ \ \"acc_norm\": 0.8252427184466019,\n \"acc_norm_stderr\": 0.03760178006026622\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8504273504273504,\n\ \ \"acc_stderr\": 0.023365051491753715,\n \"acc_norm\": 0.8504273504273504,\n\ \ \"acc_norm_stderr\": 0.023365051491753715\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.67,\n \"acc_stderr\": 0.047258156262526094,\n \ \ \"acc_norm\": 0.67,\n \"acc_norm_stderr\": 0.047258156262526094\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8109833971902938,\n\ \ \"acc_stderr\": 0.014000791294407003,\n \"acc_norm\": 0.8109833971902938,\n\ \ \"acc_norm_stderr\": 0.014000791294407003\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.6994219653179191,\n \"acc_stderr\": 0.024685316867257796,\n\ \ \"acc_norm\": 0.6994219653179191,\n \"acc_norm_stderr\": 0.024685316867257796\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.3463687150837989,\n\ \ \"acc_stderr\": 0.015913546784020117,\n \"acc_norm\": 0.3463687150837989,\n\ \ \"acc_norm_stderr\": 0.015913546784020117\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.696078431372549,\n \"acc_stderr\": 0.026336613469046626,\n\ \ \"acc_norm\": 0.696078431372549,\n \"acc_norm_stderr\": 0.026336613469046626\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7106109324758842,\n\ \ \"acc_stderr\": 0.025755865922632945,\n \"acc_norm\": 0.7106109324758842,\n\ \ \"acc_norm_stderr\": 0.025755865922632945\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7222222222222222,\n \"acc_stderr\": 0.024922001168886335,\n\ \ \"acc_norm\": 0.7222222222222222,\n \"acc_norm_stderr\": 0.024922001168886335\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.425531914893617,\n \"acc_stderr\": 0.02949482760014437,\n \ \ \"acc_norm\": 0.425531914893617,\n \"acc_norm_stderr\": 0.02949482760014437\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4276401564537158,\n\ \ \"acc_stderr\": 0.012635799922765844,\n \"acc_norm\": 0.4276401564537158,\n\ \ \"acc_norm_stderr\": 0.012635799922765844\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.6213235294117647,\n \"acc_stderr\": 0.02946513363977613,\n\ \ \"acc_norm\": 0.6213235294117647,\n \"acc_norm_stderr\": 0.02946513363977613\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6454248366013072,\n \"acc_stderr\": 0.019353360547553707,\n \ \ \"acc_norm\": 0.6454248366013072,\n \"acc_norm_stderr\": 0.019353360547553707\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6363636363636364,\n\ \ \"acc_stderr\": 0.046075820907199756,\n \"acc_norm\": 0.6363636363636364,\n\ \ \"acc_norm_stderr\": 0.046075820907199756\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7020408163265306,\n \"acc_stderr\": 0.02927956741106568,\n\ \ \"acc_norm\": 0.7020408163265306,\n \"acc_norm_stderr\": 0.02927956741106568\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8109452736318408,\n\ \ \"acc_stderr\": 0.02768691358801302,\n \"acc_norm\": 0.8109452736318408,\n\ \ \"acc_norm_stderr\": 0.02768691358801302\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.83,\n \"acc_stderr\": 0.0377525168068637,\n \ \ \"acc_norm\": 0.83,\n \"acc_norm_stderr\": 0.0377525168068637\n },\n\ \ \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5301204819277109,\n\ \ \"acc_stderr\": 0.03885425420866767,\n \"acc_norm\": 0.5301204819277109,\n\ \ \"acc_norm_stderr\": 0.03885425420866767\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8011695906432749,\n \"acc_stderr\": 0.030611116557432528,\n\ \ \"acc_norm\": 0.8011695906432749,\n \"acc_norm_stderr\": 0.030611116557432528\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.37454100367197063,\n\ \ \"mc1_stderr\": 0.016943535128405327,\n \"mc2\": 0.5477195184186756,\n\ \ \"mc2_stderr\": 0.015358664393160576\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7695343330702447,\n \"acc_stderr\": 0.011835872164836676\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.5640636846095527,\n \ \ \"acc_stderr\": 0.013658968058849159\n }\n}\n```" repo_url: https://huggingface.co/fzzhang/Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|arc:challenge|25_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-02-17T00-25-52.922442.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|gsm8k|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hellaswag|10_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-management|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-management|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-02-17T00-25-52.922442.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-international_law|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-management|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-marketing|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-sociology|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-virology|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-02-17T00-25-52.922442.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|truthfulqa:mc|0_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-02-17T00-25-52.922442.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_02_17T00_25_52.922442 path: - '**/details_harness|winogrande|5_2024-02-17T00-25-52.922442.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-02-17T00-25-52.922442.parquet' - config_name: results data_files: - split: 2024_02_17T00_25_52.922442 path: - results_2024-02-17T00-25-52.922442.parquet - split: latest path: - results_2024-02-17T00-25-52.922442.parquet --- # Dataset Card for Evaluation run of fzzhang/Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [fzzhang/Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s](https://huggingface.co/fzzhang/Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_fzzhang__Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-02-17T00:25:52.922442](https://huggingface.co/datasets/open-llm-leaderboard/details_fzzhang__Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s/blob/main/results_2024-02-17T00-25-52.922442.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6135127589001832, "acc_stderr": 0.032796760940438034, "acc_norm": 0.6157670560754505, "acc_norm_stderr": 0.033451817306662635, "mc1": 0.37454100367197063, "mc1_stderr": 0.016943535128405327, "mc2": 0.5477195184186756, "mc2_stderr": 0.015358664393160576 }, "harness|arc:challenge|25": { "acc": 0.5930034129692833, "acc_stderr": 0.01435639941800912, "acc_norm": 0.6407849829351536, "acc_norm_stderr": 0.014020224155839162 }, "harness|hellaswag|10": { "acc": 0.6493726349332802, "acc_stderr": 0.00476191251170751, "acc_norm": 0.841167098187612, "acc_norm_stderr": 0.003647731723938848 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6074074074074074, "acc_stderr": 0.0421850621536888, "acc_norm": 0.6074074074074074, "acc_norm_stderr": 0.0421850621536888 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6447368421052632, "acc_stderr": 0.03894734487013316, "acc_norm": 0.6447368421052632, "acc_norm_stderr": 0.03894734487013316 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.6, "acc_stderr": 0.04923659639173309, "acc_norm": 0.6, "acc_norm_stderr": 0.04923659639173309 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.690566037735849, "acc_stderr": 0.028450154794118637, "acc_norm": 0.690566037735849, "acc_norm_stderr": 0.028450154794118637 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7569444444444444, "acc_stderr": 0.03586879280080341, "acc_norm": 0.7569444444444444, "acc_norm_stderr": 0.03586879280080341 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.45, "acc_stderr": 0.05, "acc_norm": 0.45, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.48, "acc_stderr": 0.050211673156867795, "acc_norm": 0.48, "acc_norm_stderr": 0.050211673156867795 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.24, "acc_stderr": 0.04292346959909282, "acc_norm": 0.24, "acc_norm_stderr": 0.04292346959909282 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.630057803468208, "acc_stderr": 0.0368122963339432, "acc_norm": 0.630057803468208, "acc_norm_stderr": 0.0368122963339432 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.3333333333333333, "acc_stderr": 0.04690650298201942, "acc_norm": 0.3333333333333333, "acc_norm_stderr": 0.04690650298201942 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.73, "acc_stderr": 0.044619604333847394, "acc_norm": 0.73, "acc_norm_stderr": 0.044619604333847394 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5361702127659574, "acc_stderr": 0.03260038511835771, "acc_norm": 0.5361702127659574, "acc_norm_stderr": 0.03260038511835771 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.4298245614035088, "acc_stderr": 0.046570472605949625, "acc_norm": 0.4298245614035088, "acc_norm_stderr": 0.046570472605949625 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.503448275862069, "acc_stderr": 0.04166567577101579, "acc_norm": 0.503448275862069, "acc_norm_stderr": 0.04166567577101579 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.3968253968253968, "acc_stderr": 0.025197101074246483, "acc_norm": 0.3968253968253968, "acc_norm_stderr": 0.025197101074246483 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4365079365079365, "acc_stderr": 0.04435932892851466, "acc_norm": 0.4365079365079365, "acc_norm_stderr": 0.04435932892851466 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7354838709677419, "acc_stderr": 0.02509189237885928, "acc_norm": 0.7354838709677419, "acc_norm_stderr": 0.02509189237885928 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.4630541871921182, "acc_stderr": 0.035083705204426656, "acc_norm": 0.4630541871921182, "acc_norm_stderr": 0.035083705204426656 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.6, "acc_stderr": 0.049236596391733084, "acc_norm": 0.6, "acc_norm_stderr": 0.049236596391733084 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7454545454545455, "acc_stderr": 0.03401506715249039, "acc_norm": 0.7454545454545455, "acc_norm_stderr": 0.03401506715249039 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7171717171717171, "acc_stderr": 0.03208779558786751, "acc_norm": 0.7171717171717171, "acc_norm_stderr": 0.03208779558786751 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8549222797927462, "acc_stderr": 0.025416343096306422, "acc_norm": 0.8549222797927462, "acc_norm_stderr": 0.025416343096306422 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6076923076923076, "acc_stderr": 0.024756000382130952, "acc_norm": 0.6076923076923076, "acc_norm_stderr": 0.024756000382130952 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.34074074074074073, "acc_stderr": 0.028897748741131147, "acc_norm": 0.34074074074074073, "acc_norm_stderr": 0.028897748741131147 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6428571428571429, "acc_stderr": 0.031124619309328177, "acc_norm": 0.6428571428571429, "acc_norm_stderr": 0.031124619309328177 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.304635761589404, "acc_stderr": 0.03757949922943343, "acc_norm": 0.304635761589404, "acc_norm_stderr": 0.03757949922943343 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8165137614678899, "acc_stderr": 0.0165952597103993, "acc_norm": 0.8165137614678899, "acc_norm_stderr": 0.0165952597103993 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5046296296296297, "acc_stderr": 0.03409825519163572, "acc_norm": 0.5046296296296297, "acc_norm_stderr": 0.03409825519163572 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.7647058823529411, "acc_stderr": 0.029771775228145635, "acc_norm": 0.7647058823529411, "acc_norm_stderr": 0.029771775228145635 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7805907172995781, "acc_stderr": 0.026939106581553945, "acc_norm": 0.7805907172995781, "acc_norm_stderr": 0.026939106581553945 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6367713004484304, "acc_stderr": 0.032277904428505, "acc_norm": 0.6367713004484304, "acc_norm_stderr": 0.032277904428505 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7404580152671756, "acc_stderr": 0.03844876139785271, "acc_norm": 0.7404580152671756, "acc_norm_stderr": 0.03844876139785271 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7272727272727273, "acc_stderr": 0.04065578140908705, "acc_norm": 0.7272727272727273, "acc_norm_stderr": 0.04065578140908705 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7685185185185185, "acc_stderr": 0.04077494709252627, "acc_norm": 0.7685185185185185, "acc_norm_stderr": 0.04077494709252627 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7055214723926381, "acc_stderr": 0.03581165790474082, "acc_norm": 0.7055214723926381, "acc_norm_stderr": 0.03581165790474082 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.5535714285714286, "acc_stderr": 0.047184714852195865, "acc_norm": 0.5535714285714286, "acc_norm_stderr": 0.047184714852195865 }, "harness|hendrycksTest-management|5": { "acc": 0.8252427184466019, "acc_stderr": 0.03760178006026622, "acc_norm": 0.8252427184466019, "acc_norm_stderr": 0.03760178006026622 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8504273504273504, "acc_stderr": 0.023365051491753715, "acc_norm": 0.8504273504273504, "acc_norm_stderr": 0.023365051491753715 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.67, "acc_stderr": 0.047258156262526094, "acc_norm": 0.67, "acc_norm_stderr": 0.047258156262526094 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8109833971902938, "acc_stderr": 0.014000791294407003, "acc_norm": 0.8109833971902938, "acc_norm_stderr": 0.014000791294407003 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.6994219653179191, "acc_stderr": 0.024685316867257796, "acc_norm": 0.6994219653179191, "acc_norm_stderr": 0.024685316867257796 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.3463687150837989, "acc_stderr": 0.015913546784020117, "acc_norm": 0.3463687150837989, "acc_norm_stderr": 0.015913546784020117 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.696078431372549, "acc_stderr": 0.026336613469046626, "acc_norm": 0.696078431372549, "acc_norm_stderr": 0.026336613469046626 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7106109324758842, "acc_stderr": 0.025755865922632945, "acc_norm": 0.7106109324758842, "acc_norm_stderr": 0.025755865922632945 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7222222222222222, "acc_stderr": 0.024922001168886335, "acc_norm": 0.7222222222222222, "acc_norm_stderr": 0.024922001168886335 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.425531914893617, "acc_stderr": 0.02949482760014437, "acc_norm": 0.425531914893617, "acc_norm_stderr": 0.02949482760014437 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.4276401564537158, "acc_stderr": 0.012635799922765844, "acc_norm": 0.4276401564537158, "acc_norm_stderr": 0.012635799922765844 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6213235294117647, "acc_stderr": 0.02946513363977613, "acc_norm": 0.6213235294117647, "acc_norm_stderr": 0.02946513363977613 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6454248366013072, "acc_stderr": 0.019353360547553707, "acc_norm": 0.6454248366013072, "acc_norm_stderr": 0.019353360547553707 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6363636363636364, "acc_stderr": 0.046075820907199756, "acc_norm": 0.6363636363636364, "acc_norm_stderr": 0.046075820907199756 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7020408163265306, "acc_stderr": 0.02927956741106568, "acc_norm": 0.7020408163265306, "acc_norm_stderr": 0.02927956741106568 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8109452736318408, "acc_stderr": 0.02768691358801302, "acc_norm": 0.8109452736318408, "acc_norm_stderr": 0.02768691358801302 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.83, "acc_stderr": 0.0377525168068637, "acc_norm": 0.83, "acc_norm_stderr": 0.0377525168068637 }, "harness|hendrycksTest-virology|5": { "acc": 0.5301204819277109, "acc_stderr": 0.03885425420866767, "acc_norm": 0.5301204819277109, "acc_norm_stderr": 0.03885425420866767 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8011695906432749, "acc_stderr": 0.030611116557432528, "acc_norm": 0.8011695906432749, "acc_norm_stderr": 0.030611116557432528 }, "harness|truthfulqa:mc|0": { "mc1": 0.37454100367197063, "mc1_stderr": 0.016943535128405327, "mc2": 0.5477195184186756, "mc2_stderr": 0.015358664393160576 }, "harness|winogrande|5": { "acc": 0.7695343330702447, "acc_stderr": 0.011835872164836676 }, "harness|gsm8k|5": { "acc": 0.5640636846095527, "acc_stderr": 0.013658968058849159 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

该数据集是在评估模型 fzzhang/Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s 的过程中自动创建的,用于 Open LLM Leaderboard

数据集结构

  • 配置数量:63个配置,每个配置对应一个评估任务。
  • 数据来源:数据集从1次运行中创建,每个运行结果作为特定分片存储在每个配置中,分片名称使用运行的时间戳。
  • 最新结果:"train" 分片始终指向最新的结果。
  • 汇总结果:额外配置 "results" 存储所有运行的汇总结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_fzzhang__Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s", "harness_winogrande_5", split="train")

最新结果

以下是 2024-02-17T00:25:52.922442 运行 的最新结果:

python { "all": { "acc": 0.6135127589001832, "acc_stderr": 0.032796760940438034, "acc_norm": 0.6157670560754505, "acc_norm_stderr": 0.033451817306662635, "mc1": 0.37454100367197063, "mc1_stderr": 0.016943535128405327, "mc2": 0.5477195184186756, "mc2_stderr": 0.015358664393160576 }, "harness|arc:challenge|25": { "acc": 0.5930034129692833, "acc_stderr": 0.01435639941800912, "acc_norm": 0.6407849829351536, "acc_norm_stderr": 0.014020224155839162 }, "harness|hellaswag|10": { "acc": 0.6493726349332802, "acc_stderr": 0.00476191251170751, "acc_norm": 0.841167098187612, "acc_norm_stderr": 0.003647731723938848 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6074074074074074, "acc_stderr": 0.0421850621536888, "acc_norm": 0.6074074074074074, "acc_norm_stderr": 0.0421850621536888 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6447368421052632, "acc_stderr": 0.03894734487013316, "acc_norm": 0.6447368421052632, "acc_norm_stderr": 0.03894734487013316 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.6, "acc_stderr": 0.04923659639173309, "acc_norm": 0.6, "acc_norm_stderr": 0.04923659639173309 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.690566037735849, "acc_stderr": 0.028450154794118637, "acc_norm": 0.690566037735849, "acc_norm_stderr": 0.028450154794118637 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7569444444444444, "acc_stderr": 0.03586879280080341, "acc_norm": 0.7569444444444444, "acc_norm_stderr": 0.03586879280080341 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.45, "acc_stderr": 0.05, "acc_norm": 0.45, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.48, "acc_stderr": 0.050211673156867795, "acc_norm": 0.48, "acc_norm_stderr": 0.050211673156867795 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.24, "acc_stderr": 0.04292346959909282, "acc_norm": 0.24, "acc_norm_stderr": 0.04292346959909282 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.630057803468208, "acc_stderr": 0.0368122963339432, "acc_norm": 0.630057803468208, "acc_norm_stderr": 0.0368122963339432 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.3333333333333333, "acc_stderr": 0.04690650298201942, "acc_norm": 0.3333333333333333, "acc_norm_stderr": 0.04690650298201942 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.73, "acc_stderr": 0.044619604333847394, "acc_norm": 0.73, "acc_norm_stderr": 0.044619604333847394 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5361702127659574, "acc_stderr": 0.03260038511835771, "acc_norm": 0.5361702127659574, "acc_norm_stderr": 0.03260038511835771 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.4298245614035088, "acc_stderr": 0.046570472605949625, "acc_norm": 0.4298245614035088, "acc_norm_stderr": 0.046570472605949625 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.503448275862069, "acc_stderr": 0.04166567577101579, "acc_norm": 0.503448275862069, "acc_norm_stderr": 0.04166567577101579 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.3968253968253968, "acc_stderr": 0.025197101074246483, "acc_norm": 0.3968253968253968, "acc_norm_stderr": 0.025197101074246483 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4365079365079365, "acc_stderr": 0.04435932892851466, "acc_norm": 0.4365079365079365, "acc_norm_stderr": 0.04435932892851466 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7354838709677419, "acc_stderr": 0.02509189237885928, "acc_norm": 0.7354838709677419, "acc_norm_stderr": 0.02509189237885928 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.4630541871921182, "acc_stderr": 0.035083705204426656, "acc_norm": 0.4630541871921182, "acc_norm_stderr": 0.035083705204426656 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.6, "acc_stderr": 0.049236596391733084, "acc_norm": 0.6, "acc_norm_stderr": 0.049236596391733084 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7454545454545455, "acc_stderr": 0.03401506715249039, "acc_norm": 0.7454545454545455, "acc_norm_stderr": 0.03401506715249039 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7171717171717171, "acc_stderr": 0.03208779558786751, "acc_norm": 0.7171717171717171, "acc_norm_stderr": 0.03208779558786751 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8549222797927462, "acc_stderr": 0.025416343096306422, "acc_norm": 0.8549222797927462, "acc_norm_stderr": 0.025416343096306422 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6076923076923076, "acc_stderr": 0.024756000382130952, "acc_norm": 0.6076923076923076, "acc_norm_stderr": 0.024756000382130952 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.34074074074074073, "acc_stderr": 0.028897748741131147,

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard评估流程的自动化产物而构建。其生成源于对特定模型fzzhang/Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s在标准化基准测试套件上的系统性评估。评估过程覆盖了包括ARC挑战赛、HellaSwag、MMLU以及TruthfulQA在内的63项独立任务,每项任务对应一个配置。评估运行产生的结果数据被自动捕获并结构化,形成以时间戳命名的数据分片,其中“train”分片始终指向最新的评估结果,确保了数据版本的动态更新与可追溯性。
特点
该数据集的核心特征在于其作为模型性能快照的精细粒度与全面性。它不仅汇总了模型在多个基准上的宏观表现指标,如准确率及其标准误差,更深入到每一项具体任务的评估细节。数据集通过独立的配置项清晰呈现了模型在常识推理、专业知识、数学能力及真实性等不同维度的表现差异。这种多层次、结构化的数据组织方式,为深入分析模型的能力边界与特性提供了丰富的实证基础,是模型对比与诊断研究的宝贵资源。
使用方法
研究人员可通过Hugging Face的`datasets`库便捷地加载此数据集以进行深入分析。使用方式具有高度灵活性,允许用户指定具体的任务配置(如`harness_winogrande_5`)和数据分片(如`train`或特定时间戳分片)来提取对应的详细评估记录。此外,数据集包含一个名为“results”的聚合配置,专门用于存储和计算模型在Open LLM Leaderboard上显示的宏观汇总指标。这种设计使得用户既能进行宏观层面的性能比较,也能深入到微观任务层面进行细致的错误分析与能力剖析。
背景与挑战
背景概述
在大型语言模型(LLM)迅猛发展的浪潮中,评估其综合能力成为推动技术进步的关键环节。HuggingFace平台推出的Open LLM Leaderboard,作为一个开放、标准化的模型评估基准,旨在系统性地衡量不同LLM在多样化任务上的表现。数据集‘open-llm-leaderboard-old/details_fzzhang__Marcoroni-neural-chat-7B-v2_gsm8k_quantized_mergedfloat_s’正是该排行榜框架下的产物,由社区贡献者fzzhang于2024年2月创建,记录了特定模型在ARC挑战赛、HellaSwag、MMLU(HendrycksTest)以及GSM8K等多项基准测试中的详细评估结果。该数据集不仅为模型性能提供了透明、可复现的量化证据,也为研究社区深入分析模型在常识推理、知识问答及数学解题等核心认知能力上的优劣提供了宝贵的数据支撑,从而促进了模型迭代与评估方法的优化。
当前挑战
该数据集所关联的核心挑战在于如何全面、公正地评估大型语言模型的多维能力。具体而言,评估任务需覆盖从常识推理(如ARC、HellaSwag)到专业领域知识(如MMLU涵盖的数十个学科),再到数学问题求解(GSM8K)等复杂认知维度,这要求基准设计具备极高的广度和深度,以避免评估偏差。在数据集构建过程中,挑战同样显著:自动化评估流程需确保不同任务配置(共63项)下数据采集的一致性与完整性;每次评估运行产生的时间戳分割数据需被精确归档与关联,以支持历史结果追溯与对比;此外,如何高效聚合来自多次运行的异构评估结果,并计算具有统计意义的综合指标(如准确率及其标准误),亦是维持排行榜公信力所必须解决的技术难题。
常用场景
经典使用场景
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard的评估运行结果,其经典使用场景在于为研究人员提供模型性能的细粒度分析。通过涵盖ARC挑战、HellaSwag、MMLU及GSM8K等多样化基准任务,数据集允许对模型在常识推理、语言理解、专业知识和数学解题等维度的能力进行系统性评估,从而为模型比较与优化提供实证依据。
衍生相关工作
围绕该数据集衍生的经典工作主要集中在评估框架的扩展与模型改进策略上。例如,基于其多任务评估结果,研究者开发了更精细的基准测试套件,如针对专业领域的专项评估。同时,数据集中揭示的模型弱点催生了针对性的训练技术,如知识增强或推理优化方法,进一步推动了大型语言模型在复杂任务上的性能突破。
数据集最近研究
最新研究方向
在大型语言模型评估领域,该数据集作为开放LLM排行榜的组成部分,聚焦于量化与模型融合技术的前沿探索。通过整合GSM8K数学推理任务与多领域知识评估,研究重点转向模型在量化压缩后的性能保持与跨任务泛化能力。当前热点围绕高效推理与模型轻量化展开,旨在平衡计算资源与模型精度,推动边缘设备部署与实时应用发展。这一方向对促进开源模型生态的透明化评估与标准化进程具有深远影响,为后续模型优化与架构创新提供了关键基准。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作