five

open-llm-leaderboard-old/details_Josephgflowers__GPT2-774M-CINDER-SHOW-MULTI-CHAT

收藏
Hugging Face2024-03-22 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_Josephgflowers__GPT2-774M-CINDER-SHOW-MULTI-CHAT
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of Josephgflowers/GPT2-774M-CINDER-SHOW-MULTI-CHAT dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [Josephgflowers/GPT2-774M-CINDER-SHOW-MULTI-CHAT](https://huggingface.co/Josephgflowers/GPT2-774M-CINDER-SHOW-MULTI-CHAT)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_Josephgflowers__GPT2-774M-CINDER-SHOW-MULTI-CHAT\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-03-22T01:16:42.442021](https://huggingface.co/datasets/open-llm-leaderboard/details_Josephgflowers__GPT2-774M-CINDER-SHOW-MULTI-CHAT/blob/main/results_2024-03-22T01-16-42.442021.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.2598129955043114,\n\ \ \"acc_stderr\": 0.030998231843312046,\n \"acc_norm\": 0.2605270311704007,\n\ \ \"acc_norm_stderr\": 0.03172990550289709,\n \"mc1\": 0.2141982864137087,\n\ \ \"mc1_stderr\": 0.014362148155690454,\n \"mc2\": 0.37146794643035425,\n\ \ \"mc2_stderr\": 0.015253539853221339\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.2440273037542662,\n \"acc_stderr\": 0.012551447627856257,\n\ \ \"acc_norm\": 0.26535836177474403,\n \"acc_norm_stderr\": 0.012902554762313967\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.3353913563035252,\n\ \ \"acc_stderr\": 0.004711622011148475,\n \"acc_norm\": 0.39693288189603665,\n\ \ \"acc_norm_stderr\": 0.004882619484166603\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.26,\n \"acc_stderr\": 0.04408440022768081,\n \ \ \"acc_norm\": 0.26,\n \"acc_norm_stderr\": 0.04408440022768081\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.34074074074074073,\n\ \ \"acc_stderr\": 0.04094376269996793,\n \"acc_norm\": 0.34074074074074073,\n\ \ \"acc_norm_stderr\": 0.04094376269996793\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.27631578947368424,\n \"acc_stderr\": 0.03639057569952924,\n\ \ \"acc_norm\": 0.27631578947368424,\n \"acc_norm_stderr\": 0.03639057569952924\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.23,\n\ \ \"acc_stderr\": 0.04229525846816506,\n \"acc_norm\": 0.23,\n \ \ \"acc_norm_stderr\": 0.04229525846816506\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.2679245283018868,\n \"acc_stderr\": 0.027257260322494845,\n\ \ \"acc_norm\": 0.2679245283018868,\n \"acc_norm_stderr\": 0.027257260322494845\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.2569444444444444,\n\ \ \"acc_stderr\": 0.03653946969442099,\n \"acc_norm\": 0.2569444444444444,\n\ \ \"acc_norm_stderr\": 0.03653946969442099\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.17,\n \"acc_stderr\": 0.03775251680686371,\n \ \ \"acc_norm\": 0.17,\n \"acc_norm_stderr\": 0.03775251680686371\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.26,\n \"acc_stderr\": 0.0440844002276808,\n \"acc_norm\": 0.26,\n\ \ \"acc_norm_stderr\": 0.0440844002276808\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.2,\n \"acc_stderr\": 0.04020151261036845,\n \ \ \"acc_norm\": 0.2,\n \"acc_norm_stderr\": 0.04020151261036845\n },\n\ \ \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.24277456647398843,\n\ \ \"acc_stderr\": 0.0326926380614177,\n \"acc_norm\": 0.24277456647398843,\n\ \ \"acc_norm_stderr\": 0.0326926380614177\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.22549019607843138,\n \"acc_stderr\": 0.041583075330832865,\n\ \ \"acc_norm\": 0.22549019607843138,\n \"acc_norm_stderr\": 0.041583075330832865\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.3,\n \"acc_stderr\": 0.046056618647183814,\n \"acc_norm\": 0.3,\n\ \ \"acc_norm_stderr\": 0.046056618647183814\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.2425531914893617,\n \"acc_stderr\": 0.02802022627120022,\n\ \ \"acc_norm\": 0.2425531914893617,\n \"acc_norm_stderr\": 0.02802022627120022\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.2807017543859649,\n\ \ \"acc_stderr\": 0.04227054451232199,\n \"acc_norm\": 0.2807017543859649,\n\ \ \"acc_norm_stderr\": 0.04227054451232199\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.296551724137931,\n \"acc_stderr\": 0.03806142687309993,\n\ \ \"acc_norm\": 0.296551724137931,\n \"acc_norm_stderr\": 0.03806142687309993\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.2724867724867725,\n \"acc_stderr\": 0.022930973071633345,\n \"\ acc_norm\": 0.2724867724867725,\n \"acc_norm_stderr\": 0.022930973071633345\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.15079365079365079,\n\ \ \"acc_stderr\": 0.03200686497287392,\n \"acc_norm\": 0.15079365079365079,\n\ \ \"acc_norm_stderr\": 0.03200686497287392\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.37,\n \"acc_stderr\": 0.048523658709391,\n \ \ \"acc_norm\": 0.37,\n \"acc_norm_stderr\": 0.048523658709391\n },\n\ \ \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.24838709677419354,\n\ \ \"acc_stderr\": 0.02458002892148101,\n \"acc_norm\": 0.24838709677419354,\n\ \ \"acc_norm_stderr\": 0.02458002892148101\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.2315270935960591,\n \"acc_stderr\": 0.02967833314144445,\n\ \ \"acc_norm\": 0.2315270935960591,\n \"acc_norm_stderr\": 0.02967833314144445\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.33,\n \"acc_stderr\": 0.047258156262526045,\n \"acc_norm\"\ : 0.33,\n \"acc_norm_stderr\": 0.047258156262526045\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.2909090909090909,\n \"acc_stderr\": 0.03546563019624335,\n\ \ \"acc_norm\": 0.2909090909090909,\n \"acc_norm_stderr\": 0.03546563019624335\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.21717171717171718,\n \"acc_stderr\": 0.02937661648494564,\n \"\ acc_norm\": 0.21717171717171718,\n \"acc_norm_stderr\": 0.02937661648494564\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.21761658031088082,\n \"acc_stderr\": 0.02977866303775295,\n\ \ \"acc_norm\": 0.21761658031088082,\n \"acc_norm_stderr\": 0.02977866303775295\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.24871794871794872,\n \"acc_stderr\": 0.021916957709213796,\n\ \ \"acc_norm\": 0.24871794871794872,\n \"acc_norm_stderr\": 0.021916957709213796\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.2740740740740741,\n \"acc_stderr\": 0.027195934804085622,\n \ \ \"acc_norm\": 0.2740740740740741,\n \"acc_norm_stderr\": 0.027195934804085622\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.21008403361344538,\n \"acc_stderr\": 0.026461398717471874,\n\ \ \"acc_norm\": 0.21008403361344538,\n \"acc_norm_stderr\": 0.026461398717471874\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.271523178807947,\n \"acc_stderr\": 0.03631329803969653,\n \"acc_norm\"\ : 0.271523178807947,\n \"acc_norm_stderr\": 0.03631329803969653\n },\n\ \ \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\": 0.23669724770642203,\n\ \ \"acc_stderr\": 0.01822407811729908,\n \"acc_norm\": 0.23669724770642203,\n\ \ \"acc_norm_stderr\": 0.01822407811729908\n },\n \"harness|hendrycksTest-high_school_statistics|5\"\ : {\n \"acc\": 0.24537037037037038,\n \"acc_stderr\": 0.02934666509437294,\n\ \ \"acc_norm\": 0.24537037037037038,\n \"acc_norm_stderr\": 0.02934666509437294\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.2647058823529412,\n \"acc_stderr\": 0.03096451792692341,\n \"\ acc_norm\": 0.2647058823529412,\n \"acc_norm_stderr\": 0.03096451792692341\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.26582278481012656,\n \"acc_stderr\": 0.028756799629658335,\n \ \ \"acc_norm\": 0.26582278481012656,\n \"acc_norm_stderr\": 0.028756799629658335\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.23318385650224216,\n\ \ \"acc_stderr\": 0.028380391147094716,\n \"acc_norm\": 0.23318385650224216,\n\ \ \"acc_norm_stderr\": 0.028380391147094716\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.2366412213740458,\n \"acc_stderr\": 0.03727673575596919,\n\ \ \"acc_norm\": 0.2366412213740458,\n \"acc_norm_stderr\": 0.03727673575596919\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.38016528925619836,\n \"acc_stderr\": 0.04431324501968432,\n \"\ acc_norm\": 0.38016528925619836,\n \"acc_norm_stderr\": 0.04431324501968432\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.24074074074074073,\n\ \ \"acc_stderr\": 0.041331194402438376,\n \"acc_norm\": 0.24074074074074073,\n\ \ \"acc_norm_stderr\": 0.041331194402438376\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.3006134969325153,\n \"acc_stderr\": 0.03602511318806771,\n\ \ \"acc_norm\": 0.3006134969325153,\n \"acc_norm_stderr\": 0.03602511318806771\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.21428571428571427,\n\ \ \"acc_stderr\": 0.038946411200447915,\n \"acc_norm\": 0.21428571428571427,\n\ \ \"acc_norm_stderr\": 0.038946411200447915\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.23300970873786409,\n \"acc_stderr\": 0.04185832598928315,\n\ \ \"acc_norm\": 0.23300970873786409,\n \"acc_norm_stderr\": 0.04185832598928315\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.2564102564102564,\n\ \ \"acc_stderr\": 0.028605953702004253,\n \"acc_norm\": 0.2564102564102564,\n\ \ \"acc_norm_stderr\": 0.028605953702004253\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.22,\n \"acc_stderr\": 0.041633319989322695,\n \ \ \"acc_norm\": 0.22,\n \"acc_norm_stderr\": 0.041633319989322695\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.2567049808429119,\n\ \ \"acc_stderr\": 0.01562048026306455,\n \"acc_norm\": 0.2567049808429119,\n\ \ \"acc_norm_stderr\": 0.01562048026306455\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.30057803468208094,\n \"acc_stderr\": 0.0246853168672578,\n\ \ \"acc_norm\": 0.30057803468208094,\n \"acc_norm_stderr\": 0.0246853168672578\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.24804469273743016,\n\ \ \"acc_stderr\": 0.014444157808261431,\n \"acc_norm\": 0.24804469273743016,\n\ \ \"acc_norm_stderr\": 0.014444157808261431\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.2581699346405229,\n \"acc_stderr\": 0.025058503316958154,\n\ \ \"acc_norm\": 0.2581699346405229,\n \"acc_norm_stderr\": 0.025058503316958154\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.2958199356913183,\n\ \ \"acc_stderr\": 0.02592237178881879,\n \"acc_norm\": 0.2958199356913183,\n\ \ \"acc_norm_stderr\": 0.02592237178881879\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.2623456790123457,\n \"acc_stderr\": 0.024477222856135114,\n\ \ \"acc_norm\": 0.2623456790123457,\n \"acc_norm_stderr\": 0.024477222856135114\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.2730496453900709,\n \"acc_stderr\": 0.026577860943307857,\n \ \ \"acc_norm\": 0.2730496453900709,\n \"acc_norm_stderr\": 0.026577860943307857\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.2737940026075619,\n\ \ \"acc_stderr\": 0.011388612167979388,\n \"acc_norm\": 0.2737940026075619,\n\ \ \"acc_norm_stderr\": 0.011388612167979388\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.3014705882352941,\n \"acc_stderr\": 0.027875982114273168,\n\ \ \"acc_norm\": 0.3014705882352941,\n \"acc_norm_stderr\": 0.027875982114273168\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.2875816993464052,\n \"acc_stderr\": 0.018311653053648222,\n \ \ \"acc_norm\": 0.2875816993464052,\n \"acc_norm_stderr\": 0.018311653053648222\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.20909090909090908,\n\ \ \"acc_stderr\": 0.038950910157241364,\n \"acc_norm\": 0.20909090909090908,\n\ \ \"acc_norm_stderr\": 0.038950910157241364\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.19183673469387755,\n \"acc_stderr\": 0.025206963154225395,\n\ \ \"acc_norm\": 0.19183673469387755,\n \"acc_norm_stderr\": 0.025206963154225395\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.2537313432835821,\n\ \ \"acc_stderr\": 0.03076944496729601,\n \"acc_norm\": 0.2537313432835821,\n\ \ \"acc_norm_stderr\": 0.03076944496729601\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.18,\n \"acc_stderr\": 0.03861229196653696,\n \ \ \"acc_norm\": 0.18,\n \"acc_norm_stderr\": 0.03861229196653696\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.3132530120481928,\n\ \ \"acc_stderr\": 0.03610805018031023,\n \"acc_norm\": 0.3132530120481928,\n\ \ \"acc_norm_stderr\": 0.03610805018031023\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.29239766081871343,\n \"acc_stderr\": 0.034886477134579215,\n\ \ \"acc_norm\": 0.29239766081871343,\n \"acc_norm_stderr\": 0.034886477134579215\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.2141982864137087,\n\ \ \"mc1_stderr\": 0.014362148155690454,\n \"mc2\": 0.37146794643035425,\n\ \ \"mc2_stderr\": 0.015253539853221339\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.5217048145224941,\n \"acc_stderr\": 0.014039239216484626\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.03866565579984837,\n \ \ \"acc_stderr\": 0.005310583162098055\n }\n}\n```" repo_url: https://huggingface.co/Josephgflowers/GPT2-774M-CINDER-SHOW-MULTI-CHAT leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|arc:challenge|25_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-03-22T01-16-42.442021.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|gsm8k|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hellaswag|10_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-management|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-management|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-03-22T01-16-42.442021.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-international_law|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-management|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-marketing|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-sociology|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-virology|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-03-22T01-16-42.442021.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|truthfulqa:mc|0_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-03-22T01-16-42.442021.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_03_22T01_16_42.442021 path: - '**/details_harness|winogrande|5_2024-03-22T01-16-42.442021.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-03-22T01-16-42.442021.parquet' - config_name: results data_files: - split: 2024_03_22T01_16_42.442021 path: - results_2024-03-22T01-16-42.442021.parquet - split: latest path: - results_2024-03-22T01-16-42.442021.parquet --- # Dataset Card for Evaluation run of Josephgflowers/GPT2-774M-CINDER-SHOW-MULTI-CHAT <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [Josephgflowers/GPT2-774M-CINDER-SHOW-MULTI-CHAT](https://huggingface.co/Josephgflowers/GPT2-774M-CINDER-SHOW-MULTI-CHAT) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Josephgflowers__GPT2-774M-CINDER-SHOW-MULTI-CHAT", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-03-22T01:16:42.442021](https://huggingface.co/datasets/open-llm-leaderboard/details_Josephgflowers__GPT2-774M-CINDER-SHOW-MULTI-CHAT/blob/main/results_2024-03-22T01-16-42.442021.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.2598129955043114, "acc_stderr": 0.030998231843312046, "acc_norm": 0.2605270311704007, "acc_norm_stderr": 0.03172990550289709, "mc1": 0.2141982864137087, "mc1_stderr": 0.014362148155690454, "mc2": 0.37146794643035425, "mc2_stderr": 0.015253539853221339 }, "harness|arc:challenge|25": { "acc": 0.2440273037542662, "acc_stderr": 0.012551447627856257, "acc_norm": 0.26535836177474403, "acc_norm_stderr": 0.012902554762313967 }, "harness|hellaswag|10": { "acc": 0.3353913563035252, "acc_stderr": 0.004711622011148475, "acc_norm": 0.39693288189603665, "acc_norm_stderr": 0.004882619484166603 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.26, "acc_stderr": 0.04408440022768081, "acc_norm": 0.26, "acc_norm_stderr": 0.04408440022768081 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.34074074074074073, "acc_stderr": 0.04094376269996793, "acc_norm": 0.34074074074074073, "acc_norm_stderr": 0.04094376269996793 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.27631578947368424, "acc_stderr": 0.03639057569952924, "acc_norm": 0.27631578947368424, "acc_norm_stderr": 0.03639057569952924 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.23, "acc_stderr": 0.04229525846816506, "acc_norm": 0.23, "acc_norm_stderr": 0.04229525846816506 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.2679245283018868, "acc_stderr": 0.027257260322494845, "acc_norm": 0.2679245283018868, "acc_norm_stderr": 0.027257260322494845 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.2569444444444444, "acc_stderr": 0.03653946969442099, "acc_norm": 0.2569444444444444, "acc_norm_stderr": 0.03653946969442099 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.17, "acc_stderr": 0.03775251680686371, "acc_norm": 0.17, "acc_norm_stderr": 0.03775251680686371 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.26, "acc_stderr": 0.0440844002276808, "acc_norm": 0.26, "acc_norm_stderr": 0.0440844002276808 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.2, "acc_stderr": 0.04020151261036845, "acc_norm": 0.2, "acc_norm_stderr": 0.04020151261036845 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.24277456647398843, "acc_stderr": 0.0326926380614177, "acc_norm": 0.24277456647398843, "acc_norm_stderr": 0.0326926380614177 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.22549019607843138, "acc_stderr": 0.041583075330832865, "acc_norm": 0.22549019607843138, "acc_norm_stderr": 0.041583075330832865 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.2425531914893617, "acc_stderr": 0.02802022627120022, "acc_norm": 0.2425531914893617, "acc_norm_stderr": 0.02802022627120022 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.2807017543859649, "acc_stderr": 0.04227054451232199, "acc_norm": 0.2807017543859649, "acc_norm_stderr": 0.04227054451232199 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.296551724137931, "acc_stderr": 0.03806142687309993, "acc_norm": 0.296551724137931, "acc_norm_stderr": 0.03806142687309993 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.2724867724867725, "acc_stderr": 0.022930973071633345, "acc_norm": 0.2724867724867725, "acc_norm_stderr": 0.022930973071633345 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.15079365079365079, "acc_stderr": 0.03200686497287392, "acc_norm": 0.15079365079365079, "acc_norm_stderr": 0.03200686497287392 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.37, "acc_stderr": 0.048523658709391, "acc_norm": 0.37, "acc_norm_stderr": 0.048523658709391 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.24838709677419354, "acc_stderr": 0.02458002892148101, "acc_norm": 0.24838709677419354, "acc_norm_stderr": 0.02458002892148101 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.2315270935960591, "acc_stderr": 0.02967833314144445, "acc_norm": 0.2315270935960591, "acc_norm_stderr": 0.02967833314144445 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.33, "acc_stderr": 0.047258156262526045, "acc_norm": 0.33, "acc_norm_stderr": 0.047258156262526045 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.2909090909090909, "acc_stderr": 0.03546563019624335, "acc_norm": 0.2909090909090909, "acc_norm_stderr": 0.03546563019624335 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.21717171717171718, "acc_stderr": 0.02937661648494564, "acc_norm": 0.21717171717171718, "acc_norm_stderr": 0.02937661648494564 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.21761658031088082, "acc_stderr": 0.02977866303775295, "acc_norm": 0.21761658031088082, "acc_norm_stderr": 0.02977866303775295 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.24871794871794872, "acc_stderr": 0.021916957709213796, "acc_norm": 0.24871794871794872, "acc_norm_stderr": 0.021916957709213796 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.2740740740740741, "acc_stderr": 0.027195934804085622, "acc_norm": 0.2740740740740741, "acc_norm_stderr": 0.027195934804085622 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.21008403361344538, "acc_stderr": 0.026461398717471874, "acc_norm": 0.21008403361344538, "acc_norm_stderr": 0.026461398717471874 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.271523178807947, "acc_stderr": 0.03631329803969653, "acc_norm": 0.271523178807947, "acc_norm_stderr": 0.03631329803969653 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.23669724770642203, "acc_stderr": 0.01822407811729908, "acc_norm": 0.23669724770642203, "acc_norm_stderr": 0.01822407811729908 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.24537037037037038, "acc_stderr": 0.02934666509437294, "acc_norm": 0.24537037037037038, "acc_norm_stderr": 0.02934666509437294 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.2647058823529412, "acc_stderr": 0.03096451792692341, "acc_norm": 0.2647058823529412, "acc_norm_stderr": 0.03096451792692341 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.26582278481012656, "acc_stderr": 0.028756799629658335, "acc_norm": 0.26582278481012656, "acc_norm_stderr": 0.028756799629658335 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.23318385650224216, "acc_stderr": 0.028380391147094716, "acc_norm": 0.23318385650224216, "acc_norm_stderr": 0.028380391147094716 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.2366412213740458, "acc_stderr": 0.03727673575596919, "acc_norm": 0.2366412213740458, "acc_norm_stderr": 0.03727673575596919 }, "harness|hendrycksTest-international_law|5": { "acc": 0.38016528925619836, "acc_stderr": 0.04431324501968432, "acc_norm": 0.38016528925619836, "acc_norm_stderr": 0.04431324501968432 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.24074074074074073, "acc_stderr": 0.041331194402438376, "acc_norm": 0.24074074074074073, "acc_norm_stderr": 0.041331194402438376 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.3006134969325153, "acc_stderr": 0.03602511318806771, "acc_norm": 0.3006134969325153, "acc_norm_stderr": 0.03602511318806771 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.21428571428571427, "acc_stderr": 0.038946411200447915, "acc_norm": 0.21428571428571427, "acc_norm_stderr": 0.038946411200447915 }, "harness|hendrycksTest-management|5": { "acc": 0.23300970873786409, "acc_stderr": 0.04185832598928315, "acc_norm": 0.23300970873786409, "acc_norm_stderr": 0.04185832598928315 }, "harness|hendrycksTest-marketing|5": { "acc": 0.2564102564102564, "acc_stderr": 0.028605953702004253, "acc_norm": 0.2564102564102564, "acc_norm_stderr": 0.028605953702004253 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.22, "acc_stderr": 0.041633319989322695, "acc_norm": 0.22, "acc_norm_stderr": 0.041633319989322695 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.2567049808429119, "acc_stderr": 0.01562048026306455, "acc_norm": 0.2567049808429119, "acc_norm_stderr": 0.01562048026306455 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.30057803468208094, "acc_stderr": 0.0246853168672578, "acc_norm": 0.30057803468208094, "acc_norm_stderr": 0.0246853168672578 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.24804469273743016, "acc_stderr": 0.014444157808261431, "acc_norm": 0.24804469273743016, "acc_norm_stderr": 0.014444157808261431 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.2581699346405229, "acc_stderr": 0.025058503316958154, "acc_norm": 0.2581699346405229, "acc_norm_stderr": 0.025058503316958154 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.2958199356913183, "acc_stderr": 0.02592237178881879, "acc_norm": 0.2958199356913183, "acc_norm_stderr": 0.02592237178881879 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.2623456790123457, "acc_stderr": 0.024477222856135114, "acc_norm": 0.2623456790123457, "acc_norm_stderr": 0.024477222856135114 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.2730496453900709, "acc_stderr": 0.026577860943307857, "acc_norm": 0.2730496453900709, "acc_norm_stderr": 0.026577860943307857 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.2737940026075619, "acc_stderr": 0.011388612167979388, "acc_norm": 0.2737940026075619, "acc_norm_stderr": 0.011388612167979388 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.3014705882352941, "acc_stderr": 0.027875982114273168, "acc_norm": 0.3014705882352941, "acc_norm_stderr": 0.027875982114273168 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.2875816993464052, "acc_stderr": 0.018311653053648222, "acc_norm": 0.2875816993464052, "acc_norm_stderr": 0.018311653053648222 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.20909090909090908, "acc_stderr": 0.038950910157241364, "acc_norm": 0.20909090909090908, "acc_norm_stderr": 0.038950910157241364 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.19183673469387755, "acc_stderr": 0.025206963154225395, "acc_norm": 0.19183673469387755, "acc_norm_stderr": 0.025206963154225395 }, "harness|hendrycksTest-sociology|5": { "acc": 0.2537313432835821, "acc_stderr": 0.03076944496729601, "acc_norm": 0.2537313432835821, "acc_norm_stderr": 0.03076944496729601 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.18, "acc_stderr": 0.03861229196653696, "acc_norm": 0.18, "acc_norm_stderr": 0.03861229196653696 }, "harness|hendrycksTest-virology|5": { "acc": 0.3132530120481928, "acc_stderr": 0.03610805018031023, "acc_norm": 0.3132530120481928, "acc_norm_stderr": 0.03610805018031023 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.29239766081871343, "acc_stderr": 0.034886477134579215, "acc_norm": 0.29239766081871343, "acc_norm_stderr": 0.034886477134579215 }, "harness|truthfulqa:mc|0": { "mc1": 0.2141982864137087, "mc1_stderr": 0.014362148155690454, "mc2": 0.37146794643035425, "mc2_stderr": 0.015253539853221339 }, "harness|winogrande|5": { "acc": 0.5217048145224941, "acc_stderr": 0.014039239216484626 }, "harness|gsm8k|5": { "acc": 0.03866565579984837, "acc_stderr": 0.005310583162098055 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

该数据集是在评估模型Josephgflowers/GPT2-774M-CINDER-SHOW-MULTI-CHATOpen LLM Leaderboard上的运行过程中自动创建的。

数据集组成

数据集由63个配置组成,每个配置对应一个评估任务。数据集是从1次运行中创建的,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。"train"分割始终指向最新的结果。

额外配置

一个额外的配置"results"存储了所有运行的聚合结果,用于计算和显示在Open LLM Leaderboard上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Josephgflowers__GPT2-774M-CINDER-SHOW-MULTI-CHAT", "harness_winogrande_5", split="train")

最新结果

以下是最新结果来自2024-03-22T01:16:42.442021的摘要:

python { "all": { "acc": 0.2598129955043114, "acc_stderr": 0.030998231843312046, "acc_norm": 0.2605270311704007, "acc_norm_stderr": 0.03172990550289709, "mc1": 0.2141982864137087, "mc1_stderr": 0.014362148155690454, "mc2": 0.37146794643035425, "mc2_stderr": 0.015253539853221339 }, "harness|arc:challenge|25": { "acc": 0.2440273037542662, "acc_stderr": 0.012551447627856257, "acc_norm": 0.26535836177474403, "acc_norm_stderr": 0.012902554762313967 }, "harness|hellaswag|10": { "acc": 0.3353913563035252, "acc_stderr": 0.004711622011148475, "acc_norm": 0.39693288189603665, "acc_norm_stderr": 0.004882619484166603 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.26, "acc_stderr": 0.04408440022768081, "acc_norm": 0.26, "acc_norm_stderr": 0.04408440022768081 }, # 其他任务的结果... }

配置详情

  • config_name: harness_arc_challenge_25

    • data_files:
      • split: 2024_03_22T01_16_42.442021
        • **/details_harness|arc:challenge|25_2024-03-22T01-16-42.442021.parquet
      • split: latest
        • **/details_harness|arc:challenge|25_2024-03-22T01-16-42.442021.parquet
  • config_name: harness_gsm8k_5

    • data_files:
      • split: 2024_03_22T01_16_42.442021
        • **/details_harness|gsm8k|5_2024-03-22T01-16-42.442021.parquet
      • split: latest
        • **/details_harness|gsm8k|5_2024-03-22T01-16-42.442021.parquet
  • config_name: harness_hellaswag_10

    • data_files:
      • split: 2024_03_22T01_16_42.442021
        • **/details_harness|hellaswag|10_2024-03-22T01-16-42.442021.parquet
      • split: latest
        • **/details_harness|hellaswag|10_2024-03-22T01-16-42.442021.parquet
  • config_name: harness_hendrycksTest_5

    • data_files:
      • split: 2024_03_22T01_16_42.442021
        • **/details_harness|hendrycksTest-abstract_algebra|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-anatomy|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-astronomy|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-business_ethics|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-college_biology|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-college_chemistry|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-college_computer_science|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-college_mathematics|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-college_medicine|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-college_physics|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-computer_security|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-conceptual_physics|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-econometrics|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-electrical_engineering|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-22T01-16-42.442021.parquet
        • **/details_harness|hendrycksTest-formal_logic|5_2024-03-22T01-16-42.442021.parquet }
搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是在Open LLM Leaderboard评测框架下,对Josephgflowers/GPT2-774M-CINDER-SHOW-MULTI-CHAT模型执行自动化评估过程中自动生成的。数据集由63个配置组成,每个配置对应一个被评估的任务。数据来源于单次运行,每次运行的结果以时间戳命名的分割形式存储于各配置中,其中“train”分割始终指向最新的评测结果。此外,一个名为“results”的额外配置存储了所有聚合后的运行结果,用于在Open LLM Leaderboard上计算和展示综合指标。
特点
该数据集的结构化设计体现了评测任务的多样性与层次性,涵盖ARC挑战赛、HellaSwag、GSM8K、Winogrande、TruthfulQA以及涵盖57个学科的MMLU基准测试等任务。每个配置均包含详细的性能指标,如准确率及其标准误差,为模型能力评估提供了细粒度的量化依据。数据集通过时间戳分割实现了版本追溯,确保评测结果的可复现性与历史对比能力。
使用方法
用户可通过HuggingFace的datasets库便捷加载数据,例如使用load_dataset函数指定配置名称(如"harness_winogrande_5")和分割(如"train")即可获取特定任务的评测详情。数据以Parquet格式存储,支持高效读取。用户亦可直接访问“results”配置获取聚合后的总体结果,便于进行模型性能的综合分析与横向比较。
背景与挑战
背景概述
随着大语言模型(LLM)领域的迅猛发展,如何系统、公正地评估模型性能成为研究焦点。由HuggingFace团队与多家研究机构合作构建的Open LLM Leaderboard,自2023年推出以来,已成为社区广泛认可的模型评测基准平台。该数据集正是针对Josephgflowers提交的GPT2-774M-CINDER-SHOW-MULTI-CHAT模型在Leaderboard上的单次评估运行而自动生成的,记录了模型在63个不同配置下的详细评测结果,涵盖ARC挑战、HellaSwag、MMLU、TruthfulQA、Winogrande及GSM8K等多项经典基准任务。通过提供结构化的评测数据,该数据集不仅有助于研究者深入理解该特定模型在常识推理、知识问答、数学求解及事实一致性等维度的能力边界,更为后续模型迭代与对比分析提供了可复现的量化依据,推动了LLM评估流程的标准化与透明化。
当前挑战
该数据集所反映的核心挑战在于,GPT2-774M-CINDER-SHOW-MULTI-CHAT模型在多项任务上的表现普遍偏低,例如ARC挑战的准确率仅为24.4%,HellaSwag为33.5%,GSM8K更是低至3.9%,揭示了小规模生成式模型在面对复杂推理与精确数学任务时的本质局限。从构建层面看,数据集生成过程面临多重技术难题:其一,需将不同评测框架(如EleutherAI Harness)输出的异构结果统一为标准化Parquet格式,并维护63个独立配置与时间戳分片的一致性;其二,评估覆盖了从抽象代数到医学遗传学等57个细粒度学科,导致数据维度极高且各任务样本量不均衡,增加了聚合分析的复杂性;其三,自动生成的流水线需确保每次运行结果的可复现性,同时动态更新“最新”分片以反映模型迭代后的最新表现,这对数据版本管理与存储架构提出了严苛要求。
常用场景
经典使用场景
在自然语言处理与大规模语言模型评估的交叉领域中,open-llm-leaderboard-old/details_Josephgflowers__GPT2-774M-CINDER-SHOW-MULTI-CHAT数据集作为Open LLM Leaderboard的自动化评估产物,承载着对GPT2-774M-CINDER-SHOW-MULTI-CHAT模型在63项任务上性能的精细刻画。其经典使用场景在于为研究者提供标准化的模型评估基准,通过加载各任务配置下的详细结果,实现模型在ARC挑战赛、HellaSwag常识推理、MMLU多学科知识测试等核心基准上的能力对比。该数据集以parquet格式存储每次运行的细粒度指标,支持按时间戳分割回溯历史表现,从而系统性地追踪模型迭代过程中的能力演变。
实际应用
在实际工程应用中,该数据集为模型选型与部署决策提供了实证支撑。开发团队可借助其中记录的准确率、标准差等统计量,筛选出在特定任务(如多学科问答或数学推理)上表现稳健的模型版本。例如,在构建教育辅导系统时,MMLU子任务的细粒度结果能指导模型在生物学与物理学等领域的知识覆盖度评估;而在对话系统优化中,TruthfulQA的mc1与mc2指标则直接反映模型避免生成误导性信息的倾向。这种基于数据的量化评估显著降低了模型上线前的试错成本,加速了从研究原型到产品落地的转化。
衍生相关工作
该数据集催生了一系列围绕大规模语言模型评估方法论的研究工作。其结构化评估框架启发了后续如Open LLM Leaderboard v2等迭代版本,推动评估任务从静态榜单向动态、可扩展的生态系统演进。基于此数据集,研究者进一步开发了自动化评估管道优化工具,通过分析模型在HellaSwag等任务上的失败案例,提出针对性训练策略。同时,该数据集中包含的多任务细粒度结果也被用于构建模型能力图谱,衍生出诸如模型知识盲区探测与泛化边界刻画等新兴研究方向,深化了对预训练语言模型行为规律的理解。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作