five

open-llm-leaderboard-old/details_GOAT-AI__GOAT-70B-Storytelling

收藏
Hugging Face2024-01-08 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_GOAT-AI__GOAT-70B-Storytelling
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of GOAT-AI/GOAT-70B-Storytelling dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [GOAT-AI/GOAT-70B-Storytelling](https://huggingface.co/GOAT-AI/GOAT-70B-Storytelling)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_GOAT-AI__GOAT-70B-Storytelling\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-01-08T04:02:16.743914](https://huggingface.co/datasets/open-llm-leaderboard/details_GOAT-AI__GOAT-70B-Storytelling/blob/main/results_2024-01-08T04-02-16.743914.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6955334014859299,\n\ \ \"acc_stderr\": 0.03022110624122134,\n \"acc_norm\": 0.7020604921664385,\n\ \ \"acc_norm_stderr\": 0.030808489808640836,\n \"mc1\": 0.3818849449204406,\n\ \ \"mc1_stderr\": 0.017008101939163495,\n \"mc2\": 0.535286285114223,\n\ \ \"mc2_stderr\": 0.014750619695125833\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.643344709897611,\n \"acc_stderr\": 0.013998056902620194,\n\ \ \"acc_norm\": 0.6877133105802048,\n \"acc_norm_stderr\": 0.013542598541688065\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6848237402907787,\n\ \ \"acc_stderr\": 0.004636365534819762,\n \"acc_norm\": 0.877414857598088,\n\ \ \"acc_norm_stderr\": 0.0032729014349397656\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.32,\n \"acc_stderr\": 0.046882617226215034,\n \ \ \"acc_norm\": 0.32,\n \"acc_norm_stderr\": 0.046882617226215034\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6222222222222222,\n\ \ \"acc_stderr\": 0.04188307537595852,\n \"acc_norm\": 0.6222222222222222,\n\ \ \"acc_norm_stderr\": 0.04188307537595852\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.8026315789473685,\n \"acc_stderr\": 0.03238981601699397,\n\ \ \"acc_norm\": 0.8026315789473685,\n \"acc_norm_stderr\": 0.03238981601699397\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.74,\n\ \ \"acc_stderr\": 0.044084400227680794,\n \"acc_norm\": 0.74,\n \ \ \"acc_norm_stderr\": 0.044084400227680794\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7169811320754716,\n \"acc_stderr\": 0.027724236492700918,\n\ \ \"acc_norm\": 0.7169811320754716,\n \"acc_norm_stderr\": 0.027724236492700918\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.8333333333333334,\n\ \ \"acc_stderr\": 0.031164899666948617,\n \"acc_norm\": 0.8333333333333334,\n\ \ \"acc_norm_stderr\": 0.031164899666948617\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.5,\n \"acc_stderr\": 0.050251890762960605,\n \ \ \"acc_norm\": 0.5,\n \"acc_norm_stderr\": 0.050251890762960605\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.6,\n \"acc_stderr\": 0.049236596391733084,\n \"acc_norm\": 0.6,\n\ \ \"acc_norm_stderr\": 0.049236596391733084\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.4,\n \"acc_stderr\": 0.049236596391733084,\n \ \ \"acc_norm\": 0.4,\n \"acc_norm_stderr\": 0.049236596391733084\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6647398843930635,\n\ \ \"acc_stderr\": 0.03599586301247077,\n \"acc_norm\": 0.6647398843930635,\n\ \ \"acc_norm_stderr\": 0.03599586301247077\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.35294117647058826,\n \"acc_stderr\": 0.047551296160629475,\n\ \ \"acc_norm\": 0.35294117647058826,\n \"acc_norm_stderr\": 0.047551296160629475\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.8,\n \"acc_stderr\": 0.04020151261036846,\n \"acc_norm\": 0.8,\n\ \ \"acc_norm_stderr\": 0.04020151261036846\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.6680851063829787,\n \"acc_stderr\": 0.03078373675774564,\n\ \ \"acc_norm\": 0.6680851063829787,\n \"acc_norm_stderr\": 0.03078373675774564\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.4473684210526316,\n\ \ \"acc_stderr\": 0.046774730044912,\n \"acc_norm\": 0.4473684210526316,\n\ \ \"acc_norm_stderr\": 0.046774730044912\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.6344827586206897,\n \"acc_stderr\": 0.040131241954243856,\n\ \ \"acc_norm\": 0.6344827586206897,\n \"acc_norm_stderr\": 0.040131241954243856\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.4312169312169312,\n \"acc_stderr\": 0.0255064816981382,\n \"acc_norm\"\ : 0.4312169312169312,\n \"acc_norm_stderr\": 0.0255064816981382\n },\n\ \ \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.46825396825396826,\n\ \ \"acc_stderr\": 0.04463112720677172,\n \"acc_norm\": 0.46825396825396826,\n\ \ \"acc_norm_stderr\": 0.04463112720677172\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.49,\n \"acc_stderr\": 0.05024183937956912,\n \ \ \"acc_norm\": 0.49,\n \"acc_norm_stderr\": 0.05024183937956912\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.8225806451612904,\n\ \ \"acc_stderr\": 0.021732540689329283,\n \"acc_norm\": 0.8225806451612904,\n\ \ \"acc_norm_stderr\": 0.021732540689329283\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.5270935960591133,\n \"acc_stderr\": 0.03512819077876106,\n\ \ \"acc_norm\": 0.5270935960591133,\n \"acc_norm_stderr\": 0.03512819077876106\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.77,\n \"acc_stderr\": 0.04229525846816506,\n \"acc_norm\"\ : 0.77,\n \"acc_norm_stderr\": 0.04229525846816506\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.8363636363636363,\n \"acc_stderr\": 0.02888787239548795,\n\ \ \"acc_norm\": 0.8363636363636363,\n \"acc_norm_stderr\": 0.02888787239548795\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.8737373737373737,\n \"acc_stderr\": 0.023664359402880236,\n \"\ acc_norm\": 0.8737373737373737,\n \"acc_norm_stderr\": 0.023664359402880236\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.9481865284974094,\n \"acc_stderr\": 0.01599622932024412,\n\ \ \"acc_norm\": 0.9481865284974094,\n \"acc_norm_stderr\": 0.01599622932024412\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.7205128205128205,\n \"acc_stderr\": 0.022752388839776823,\n\ \ \"acc_norm\": 0.7205128205128205,\n \"acc_norm_stderr\": 0.022752388839776823\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.34814814814814815,\n \"acc_stderr\": 0.02904560029061626,\n \ \ \"acc_norm\": 0.34814814814814815,\n \"acc_norm_stderr\": 0.02904560029061626\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.7647058823529411,\n \"acc_stderr\": 0.02755361446786381,\n \ \ \"acc_norm\": 0.7647058823529411,\n \"acc_norm_stderr\": 0.02755361446786381\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.4304635761589404,\n \"acc_stderr\": 0.040428099613956346,\n \"\ acc_norm\": 0.4304635761589404,\n \"acc_norm_stderr\": 0.040428099613956346\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8954128440366973,\n \"acc_stderr\": 0.013120530245265593,\n \"\ acc_norm\": 0.8954128440366973,\n \"acc_norm_stderr\": 0.013120530245265593\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.6018518518518519,\n \"acc_stderr\": 0.033384734032074016,\n \"\ acc_norm\": 0.6018518518518519,\n \"acc_norm_stderr\": 0.033384734032074016\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.9117647058823529,\n \"acc_stderr\": 0.019907399791316945,\n \"\ acc_norm\": 0.9117647058823529,\n \"acc_norm_stderr\": 0.019907399791316945\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.8734177215189873,\n \"acc_stderr\": 0.021644195727955173,\n \ \ \"acc_norm\": 0.8734177215189873,\n \"acc_norm_stderr\": 0.021644195727955173\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.8071748878923767,\n\ \ \"acc_stderr\": 0.026478240960489365,\n \"acc_norm\": 0.8071748878923767,\n\ \ \"acc_norm_stderr\": 0.026478240960489365\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.8702290076335878,\n \"acc_stderr\": 0.029473649496907065,\n\ \ \"acc_norm\": 0.8702290076335878,\n \"acc_norm_stderr\": 0.029473649496907065\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.8677685950413223,\n \"acc_stderr\": 0.03092278832044579,\n \"\ acc_norm\": 0.8677685950413223,\n \"acc_norm_stderr\": 0.03092278832044579\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.8148148148148148,\n\ \ \"acc_stderr\": 0.03755265865037182,\n \"acc_norm\": 0.8148148148148148,\n\ \ \"acc_norm_stderr\": 0.03755265865037182\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.8159509202453987,\n \"acc_stderr\": 0.03044677768797173,\n\ \ \"acc_norm\": 0.8159509202453987,\n \"acc_norm_stderr\": 0.03044677768797173\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.5446428571428571,\n\ \ \"acc_stderr\": 0.04726835553719097,\n \"acc_norm\": 0.5446428571428571,\n\ \ \"acc_norm_stderr\": 0.04726835553719097\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.8349514563106796,\n \"acc_stderr\": 0.03675668832233188,\n\ \ \"acc_norm\": 0.8349514563106796,\n \"acc_norm_stderr\": 0.03675668832233188\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.905982905982906,\n\ \ \"acc_stderr\": 0.01911989279892498,\n \"acc_norm\": 0.905982905982906,\n\ \ \"acc_norm_stderr\": 0.01911989279892498\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.72,\n \"acc_stderr\": 0.04512608598542126,\n \ \ \"acc_norm\": 0.72,\n \"acc_norm_stderr\": 0.04512608598542126\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8646232439335888,\n\ \ \"acc_stderr\": 0.012234384586856491,\n \"acc_norm\": 0.8646232439335888,\n\ \ \"acc_norm_stderr\": 0.012234384586856491\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7774566473988439,\n \"acc_stderr\": 0.022394215661942815,\n\ \ \"acc_norm\": 0.7774566473988439,\n \"acc_norm_stderr\": 0.022394215661942815\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.45251396648044695,\n\ \ \"acc_stderr\": 0.016646914804438775,\n \"acc_norm\": 0.45251396648044695,\n\ \ \"acc_norm_stderr\": 0.016646914804438775\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7875816993464052,\n \"acc_stderr\": 0.023420375478296125,\n\ \ \"acc_norm\": 0.7875816993464052,\n \"acc_norm_stderr\": 0.023420375478296125\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.77491961414791,\n\ \ \"acc_stderr\": 0.023720088516179027,\n \"acc_norm\": 0.77491961414791,\n\ \ \"acc_norm_stderr\": 0.023720088516179027\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.8333333333333334,\n \"acc_stderr\": 0.020736358408060002,\n\ \ \"acc_norm\": 0.8333333333333334,\n \"acc_norm_stderr\": 0.020736358408060002\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.5638297872340425,\n \"acc_stderr\": 0.029583452036284076,\n \ \ \"acc_norm\": 0.5638297872340425,\n \"acc_norm_stderr\": 0.029583452036284076\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.5423728813559322,\n\ \ \"acc_stderr\": 0.012724296550980188,\n \"acc_norm\": 0.5423728813559322,\n\ \ \"acc_norm_stderr\": 0.012724296550980188\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.7389705882352942,\n \"acc_stderr\": 0.02667925227010314,\n\ \ \"acc_norm\": 0.7389705882352942,\n \"acc_norm_stderr\": 0.02667925227010314\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.7598039215686274,\n \"acc_stderr\": 0.017282760695167404,\n \ \ \"acc_norm\": 0.7598039215686274,\n \"acc_norm_stderr\": 0.017282760695167404\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.7181818181818181,\n\ \ \"acc_stderr\": 0.043091187099464585,\n \"acc_norm\": 0.7181818181818181,\n\ \ \"acc_norm_stderr\": 0.043091187099464585\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.8163265306122449,\n \"acc_stderr\": 0.024789071332007633,\n\ \ \"acc_norm\": 0.8163265306122449,\n \"acc_norm_stderr\": 0.024789071332007633\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.900497512437811,\n\ \ \"acc_stderr\": 0.021166216304659393,\n \"acc_norm\": 0.900497512437811,\n\ \ \"acc_norm_stderr\": 0.021166216304659393\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.92,\n \"acc_stderr\": 0.0272659924344291,\n \ \ \"acc_norm\": 0.92,\n \"acc_norm_stderr\": 0.0272659924344291\n },\n\ \ \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5542168674698795,\n\ \ \"acc_stderr\": 0.03869543323472101,\n \"acc_norm\": 0.5542168674698795,\n\ \ \"acc_norm_stderr\": 0.03869543323472101\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8538011695906432,\n \"acc_stderr\": 0.027097290118070806,\n\ \ \"acc_norm\": 0.8538011695906432,\n \"acc_norm_stderr\": 0.027097290118070806\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.3818849449204406,\n\ \ \"mc1_stderr\": 0.017008101939163495,\n \"mc2\": 0.535286285114223,\n\ \ \"mc2_stderr\": 0.014750619695125833\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.835043409629045,\n \"acc_stderr\": 0.010430917468237428\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.40788476118271416,\n \ \ \"acc_stderr\": 0.013536742075643085\n }\n}\n```" repo_url: https://huggingface.co/GOAT-AI/GOAT-70B-Storytelling leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|arc:challenge|25_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-01-08T04-02-16.743914.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|gsm8k|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hellaswag|10_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-08T04-02-16.743914.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-management|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-08T04-02-16.743914.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|truthfulqa:mc|0_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-01-08T04-02-16.743914.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_01_08T04_02_16.743914 path: - '**/details_harness|winogrande|5_2024-01-08T04-02-16.743914.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-01-08T04-02-16.743914.parquet' - config_name: results data_files: - split: 2024_01_08T04_02_16.743914 path: - results_2024-01-08T04-02-16.743914.parquet - split: latest path: - results_2024-01-08T04-02-16.743914.parquet --- # Dataset Card for Evaluation run of GOAT-AI/GOAT-70B-Storytelling <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [GOAT-AI/GOAT-70B-Storytelling](https://huggingface.co/GOAT-AI/GOAT-70B-Storytelling) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_GOAT-AI__GOAT-70B-Storytelling", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-01-08T04:02:16.743914](https://huggingface.co/datasets/open-llm-leaderboard/details_GOAT-AI__GOAT-70B-Storytelling/blob/main/results_2024-01-08T04-02-16.743914.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6955334014859299, "acc_stderr": 0.03022110624122134, "acc_norm": 0.7020604921664385, "acc_norm_stderr": 0.030808489808640836, "mc1": 0.3818849449204406, "mc1_stderr": 0.017008101939163495, "mc2": 0.535286285114223, "mc2_stderr": 0.014750619695125833 }, "harness|arc:challenge|25": { "acc": 0.643344709897611, "acc_stderr": 0.013998056902620194, "acc_norm": 0.6877133105802048, "acc_norm_stderr": 0.013542598541688065 }, "harness|hellaswag|10": { "acc": 0.6848237402907787, "acc_stderr": 0.004636365534819762, "acc_norm": 0.877414857598088, "acc_norm_stderr": 0.0032729014349397656 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6222222222222222, "acc_stderr": 0.04188307537595852, "acc_norm": 0.6222222222222222, "acc_norm_stderr": 0.04188307537595852 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.8026315789473685, "acc_stderr": 0.03238981601699397, "acc_norm": 0.8026315789473685, "acc_norm_stderr": 0.03238981601699397 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.74, "acc_stderr": 0.044084400227680794, "acc_norm": 0.74, "acc_norm_stderr": 0.044084400227680794 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7169811320754716, "acc_stderr": 0.027724236492700918, "acc_norm": 0.7169811320754716, "acc_norm_stderr": 0.027724236492700918 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.8333333333333334, "acc_stderr": 0.031164899666948617, "acc_norm": 0.8333333333333334, "acc_norm_stderr": 0.031164899666948617 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.5, "acc_stderr": 0.050251890762960605, "acc_norm": 0.5, "acc_norm_stderr": 0.050251890762960605 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.6, "acc_stderr": 0.049236596391733084, "acc_norm": 0.6, "acc_norm_stderr": 0.049236596391733084 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.4, "acc_stderr": 0.049236596391733084, "acc_norm": 0.4, "acc_norm_stderr": 0.049236596391733084 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6647398843930635, "acc_stderr": 0.03599586301247077, "acc_norm": 0.6647398843930635, "acc_norm_stderr": 0.03599586301247077 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.35294117647058826, "acc_stderr": 0.047551296160629475, "acc_norm": 0.35294117647058826, "acc_norm_stderr": 0.047551296160629475 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.8, "acc_stderr": 0.04020151261036846, "acc_norm": 0.8, "acc_norm_stderr": 0.04020151261036846 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.6680851063829787, "acc_stderr": 0.03078373675774564, "acc_norm": 0.6680851063829787, "acc_norm_stderr": 0.03078373675774564 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.4473684210526316, "acc_stderr": 0.046774730044912, "acc_norm": 0.4473684210526316, "acc_norm_stderr": 0.046774730044912 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.6344827586206897, "acc_stderr": 0.040131241954243856, "acc_norm": 0.6344827586206897, "acc_norm_stderr": 0.040131241954243856 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.4312169312169312, "acc_stderr": 0.0255064816981382, "acc_norm": 0.4312169312169312, "acc_norm_stderr": 0.0255064816981382 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.46825396825396826, "acc_stderr": 0.04463112720677172, "acc_norm": 0.46825396825396826, "acc_norm_stderr": 0.04463112720677172 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.49, "acc_stderr": 0.05024183937956912, "acc_norm": 0.49, "acc_norm_stderr": 0.05024183937956912 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.8225806451612904, "acc_stderr": 0.021732540689329283, "acc_norm": 0.8225806451612904, "acc_norm_stderr": 0.021732540689329283 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5270935960591133, "acc_stderr": 0.03512819077876106, "acc_norm": 0.5270935960591133, "acc_norm_stderr": 0.03512819077876106 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.77, "acc_stderr": 0.04229525846816506, "acc_norm": 0.77, "acc_norm_stderr": 0.04229525846816506 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.8363636363636363, "acc_stderr": 0.02888787239548795, "acc_norm": 0.8363636363636363, "acc_norm_stderr": 0.02888787239548795 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.8737373737373737, "acc_stderr": 0.023664359402880236, "acc_norm": 0.8737373737373737, "acc_norm_stderr": 0.023664359402880236 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.9481865284974094, "acc_stderr": 0.01599622932024412, "acc_norm": 0.9481865284974094, "acc_norm_stderr": 0.01599622932024412 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.7205128205128205, "acc_stderr": 0.022752388839776823, "acc_norm": 0.7205128205128205, "acc_norm_stderr": 0.022752388839776823 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.34814814814814815, "acc_stderr": 0.02904560029061626, "acc_norm": 0.34814814814814815, "acc_norm_stderr": 0.02904560029061626 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.7647058823529411, "acc_stderr": 0.02755361446786381, "acc_norm": 0.7647058823529411, "acc_norm_stderr": 0.02755361446786381 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.4304635761589404, "acc_stderr": 0.040428099613956346, "acc_norm": 0.4304635761589404, "acc_norm_stderr": 0.040428099613956346 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8954128440366973, "acc_stderr": 0.013120530245265593, "acc_norm": 0.8954128440366973, "acc_norm_stderr": 0.013120530245265593 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.6018518518518519, "acc_stderr": 0.033384734032074016, "acc_norm": 0.6018518518518519, "acc_norm_stderr": 0.033384734032074016 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.9117647058823529, "acc_stderr": 0.019907399791316945, "acc_norm": 0.9117647058823529, "acc_norm_stderr": 0.019907399791316945 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.8734177215189873, "acc_stderr": 0.021644195727955173, "acc_norm": 0.8734177215189873, "acc_norm_stderr": 0.021644195727955173 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.8071748878923767, "acc_stderr": 0.026478240960489365, "acc_norm": 0.8071748878923767, "acc_norm_stderr": 0.026478240960489365 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.8702290076335878, "acc_stderr": 0.029473649496907065, "acc_norm": 0.8702290076335878, "acc_norm_stderr": 0.029473649496907065 }, "harness|hendrycksTest-international_law|5": { "acc": 0.8677685950413223, "acc_stderr": 0.03092278832044579, "acc_norm": 0.8677685950413223, "acc_norm_stderr": 0.03092278832044579 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.8148148148148148, "acc_stderr": 0.03755265865037182, "acc_norm": 0.8148148148148148, "acc_norm_stderr": 0.03755265865037182 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.8159509202453987, "acc_stderr": 0.03044677768797173, "acc_norm": 0.8159509202453987, "acc_norm_stderr": 0.03044677768797173 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.5446428571428571, "acc_stderr": 0.04726835553719097, "acc_norm": 0.5446428571428571, "acc_norm_stderr": 0.04726835553719097 }, "harness|hendrycksTest-management|5": { "acc": 0.8349514563106796, "acc_stderr": 0.03675668832233188, "acc_norm": 0.8349514563106796, "acc_norm_stderr": 0.03675668832233188 }, "harness|hendrycksTest-marketing|5": { "acc": 0.905982905982906, "acc_stderr": 0.01911989279892498, "acc_norm": 0.905982905982906, "acc_norm_stderr": 0.01911989279892498 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.72, "acc_stderr": 0.04512608598542126, "acc_norm": 0.72, "acc_norm_stderr": 0.04512608598542126 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8646232439335888, "acc_stderr": 0.012234384586856491, "acc_norm": 0.8646232439335888, "acc_norm_stderr": 0.012234384586856491 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7774566473988439, "acc_stderr": 0.022394215661942815, "acc_norm": 0.7774566473988439, "acc_norm_stderr": 0.022394215661942815 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.45251396648044695, "acc_stderr": 0.016646914804438775, "acc_norm": 0.45251396648044695, "acc_norm_stderr": 0.016646914804438775 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7875816993464052, "acc_stderr": 0.023420375478296125, "acc_norm": 0.7875816993464052, "acc_norm_stderr": 0.023420375478296125 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.77491961414791, "acc_stderr": 0.023720088516179027, "acc_norm": 0.77491961414791, "acc_norm_stderr": 0.023720088516179027 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.8333333333333334, "acc_stderr": 0.020736358408060002, "acc_norm": 0.8333333333333334, "acc_norm_stderr": 0.020736358408060002 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.5638297872340425, "acc_stderr": 0.029583452036284076, "acc_norm": 0.5638297872340425, "acc_norm_stderr": 0.029583452036284076 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.5423728813559322, "acc_stderr": 0.012724296550980188, "acc_norm": 0.5423728813559322, "acc_norm_stderr": 0.012724296550980188 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.7389705882352942, "acc_stderr": 0.02667925227010314, "acc_norm": 0.7389705882352942, "acc_norm_stderr": 0.02667925227010314 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.7598039215686274, "acc_stderr": 0.017282760695167404, "acc_norm": 0.7598039215686274, "acc_norm_stderr": 0.017282760695167404 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.7181818181818181, "acc_stderr": 0.043091187099464585, "acc_norm": 0.7181818181818181, "acc_norm_stderr": 0.043091187099464585 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.8163265306122449, "acc_stderr": 0.024789071332007633, "acc_norm": 0.8163265306122449, "acc_norm_stderr": 0.024789071332007633 }, "harness|hendrycksTest-sociology|5": { "acc": 0.900497512437811, "acc_stderr": 0.021166216304659393, "acc_norm": 0.900497512437811, "acc_norm_stderr": 0.021166216304659393 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.92, "acc_stderr": 0.0272659924344291, "acc_norm": 0.92, "acc_norm_stderr": 0.0272659924344291 }, "harness|hendrycksTest-virology|5": { "acc": 0.5542168674698795, "acc_stderr": 0.03869543323472101, "acc_norm": 0.5542168674698795, "acc_norm_stderr": 0.03869543323472101 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8538011695906432, "acc_stderr": 0.027097290118070806, "acc_norm": 0.8538011695906432, "acc_norm_stderr": 0.027097290118070806 }, "harness|truthfulqa:mc|0": { "mc1": 0.3818849449204406, "mc1_stderr": 0.017008101939163495, "mc2": 0.535286285114223, "mc2_stderr": 0.014750619695125833 }, "harness|winogrande|5": { "acc": 0.835043409629045, "acc_stderr": 0.010430917468237428 }, "harness|gsm8k|5": { "acc": 0.40788476118271416, "acc_stderr": 0.013536742075643085 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

该数据集是在模型GOAT-AI/GOAT-70B-Storytelling的评估运行期间自动创建的,用于Open LLM Leaderboard

数据集组成

  • 数据集包含63个配置,每个配置对应一个评估任务。
  • 数据集由1次运行创建,每个运行可以在每个配置中找到特定的拆分,拆分名称使用运行的时间戳。
  • "train"拆分始终指向最新的结果。
  • 一个额外的配置"results"存储所有运行的聚合结果,用于计算和显示Open LLM Leaderboard上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_GOAT-AI__GOAT-70B-Storytelling", "harness_winogrande_5", split="train")

最新结果

这些是最新的结果,来自2024-01-08T04:02:16.743914的运行: python { "all": { "acc": 0.6955334014859299, "acc_stderr": 0.03022110624122134, "acc_norm": 0.7020604921664385, "acc_norm_stderr": 0.030808489808640836, "mc1": 0.3818849449204406, "mc1_stderr": 0.017008101939163495, "mc2": 0.535286285114223, "mc2_stderr": 0.014750619695125833 }, "harness|arc:challenge|25": { "acc": 0.643344709897611, "acc_stderr": 0.013998056902620194, "acc_norm": 0.6877133105802048, "acc_norm_stderr": 0.013542598541688065 }, "harness|hellaswag|10": { "acc": 0.6848237402907787, "acc_stderr": 0.004636365534819762, "acc_norm": 0.877414857598088, "acc_norm_stderr": 0.0032729014349397656 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6222222222222222, "acc_stderr": 0.04188307537595852, "acc_norm": 0.6222222222222222, "acc_norm_stderr": 0.04188307537595852 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.8026315789473685, "acc_stderr": 0.03238981601699397, "acc_norm": 0.8026315789473685, "acc_norm_stderr": 0.03238981601699397 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.74, "acc_stderr": 0.044084400227680794, "acc_norm": 0.74, "acc_norm_stderr": 0.044084400227680794 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7169811320754716, "acc_stderr": 0.027724236492700918, "acc_norm": 0.7169811320754716, "acc_norm_stderr": 0.027724236492700918 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.8333333333333334, "acc_stderr": 0.031164899666948617, "acc_norm": 0.8333333333333334, "acc_norm_stderr": 0.031164899666948617 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.5, "acc_stderr": 0.050251890762960605, "acc_norm": 0.5, "acc_norm_stderr": 0.050251890762960605 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.6, "acc_stderr": 0.049236596391733084, "acc_norm": 0.6, "acc_norm_stderr": 0.049236596391733084 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.4, "acc_stderr": 0.049236596391733084, "acc_norm": 0.4, "acc_norm_stderr": 0.049236596391733084 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6647398843930635, "acc_stderr": 0.03599586301247077, "acc_norm": 0.6647398843930635, "acc_norm_stderr": 0.03599586301247077 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.35294117647058826, "acc_stderr": 0.047551296160629475, "acc_norm": 0.35294117647058826, "acc_norm_stderr": 0.047551296160629475 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.8, "acc_stderr": 0.04020151261036846, "acc_norm": 0.8, "acc_norm_stderr": 0.04020151261036846 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.6680851063829787, "acc_stderr": 0.03078373675774564, "acc_norm": 0.6680851063829787, "acc_norm_stderr": 0.03078373675774564 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.4473684210526316, "acc_stderr": 0.046774730044912, "acc_norm": 0.4473684210526316, "acc_norm_stderr": 0.046774730044912 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.6344827586206897, "acc_stderr": 0.040131241954243856, "acc_norm": 0.6344827586206897, "acc_norm_stderr": 0.040131241954243856 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.4312169312169312, "acc_stderr": 0.0255064816981382, "acc_norm": 0.4312169312169312, "acc_norm_stderr": 0.0255064816981382 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.46825396825396826, "acc_stderr": 0.04463112720677172, "acc_norm": 0.46825396825396826, "acc_norm_stderr": 0.04463112720677172 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.49, "acc_stderr": 0.05024183937956912, "acc_norm": 0.49, "acc_norm_stderr": 0.05024183937956912 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.8225806451612904, "acc_stderr": 0.021732540689329283, "acc_norm": 0.8225806451612904, "acc_norm_stderr": 0.021732540689329283 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5270935960591133, "acc_stderr": 0.03512819077876106, "acc_norm": 0.5270935960591133, "acc_norm_stderr": 0.03512819077876106 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.77, "acc_stderr": 0.04229525846816506, "acc_norm": 0.77, "acc_norm_stderr": 0.04229525846816506 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.8363636363636363, "acc_stderr": 0.02888787239548795, "acc_norm": 0.8363636363636363, "acc_norm_stderr": 0.02888787239548795 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.8737373737373737, "acc_stderr": 0.023664359402880236, "acc_norm": 0.8737373737373737, "acc_norm_stderr": 0.023664359402880236 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.9481865284974094, "acc_stderr": 0.01599622932024412, "acc_norm": 0.9481865284974094, "acc_norm_stderr": 0.01599622932024412 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.7205128205128205, "acc_stderr": 0.022752388839776823, "acc_norm": 0.7205128205128205, "acc_norm_stderr": 0.022752388839776823 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.34814814814814815, "acc_stderr": 0.02904560029061626, "acc_norm": 0.34814814814814815, "acc_norm_stderr": 0.02904560029061626 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.7647058823529411, "acc_stderr":

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,Open LLM Leaderboard 为模型性能的标准化测评提供了重要平台。该数据集源自对 GOAT-AI/GOAT-70B-Storytelling 模型的一次自动化评估运行,其构建过程严谨而系统。数据集共包含63个配置项,每个配置对应一项被评估的特定任务,例如 ARC Challenge、HellaSwag、GSM8K 等。每次评估运行的结果被存储为独立的拆分,并以运行时间戳命名,而 "train" 拆分则始终指向最新的评估结果。此外,一个名为 "results" 的额外配置汇总了所有运行的聚合指标,用于在排行榜上计算和展示整体性能。所有数据以 Parquet 格式存储,确保了高效的数据存取与处理。
特点
该数据集最显著的特点在于其结构化的多任务评估架构,能够全面反映模型在推理、常识、数学及多学科知识等维度的综合能力。它涵盖了从基础逻辑到专业领域的广泛测评,例如 HendrycksTest 系列中的抽象代数、医学遗传学、国际法等57个学科子集,以及 TruthfulQA、Winogrande 等经典基准。每个任务配置均记录了详细的精度指标(如 acc、acc_norm)及其标准误差,提供了细粒度的性能画像。数据集的版本控制机制通过时间戳拆分保留了历史运行记录,便于追踪模型能力的演进。同时,"latest" 拆分的设计确保了用户始终能够便捷地获取最新评估成果。
使用方法
研究人员可通过 Hugging Face Datasets 库轻松加载该数据集进行深入分析。例如,使用 `load_dataset("open-llm-leaderboard/details_GOAT-AI__GOAT-70B-Storytelling", "harness_winogrande_5", split="train")` 即可获取 Winogrande 任务的最新评测细节。每个配置对应一个特定任务,用户可根据研究需求选择相应的配置名。数据集支持按时间戳拆分访问历史运行结果,便于进行模型性能的纵向对比。此外,"results" 配置提供了所有任务的聚合指标,适合用于整体性能评估与可视化展示。这种灵活的加载方式为模型分析、基准测试复现及后续研究提供了坚实的数据基础。
背景与挑战
背景概述
在大型语言模型(LLM)迅猛发展的浪潮中,如何系统性地评估模型在多样化自然语言处理任务上的综合能力,已成为学术界与工业界共同关注的核心议题。由HuggingFace团队主导的Open LLM Leaderboard项目,自2023年起便致力于构建一个开放、统一的模型性能竞技平台。该数据集正是针对GOAT-AI团队于2024年1月发布的GOAT-70B-Storytelling模型(一个专注于故事生成的70亿参数模型)的评估运行记录,由Clémentine Fourrier(clementine@hf.co)等研究人员创建。数据集涵盖了从常识推理(如HellaSwag、ARC-Challenge)到数学问题求解(GSM8K)再到多领域知识理解(MMLU)等63个评测任务,旨在揭示该模型在零样本或少样本设定下的泛化表现。这一评估体系不仅为故事生成模型的能力边界提供了量化基准,更推动了LLM评测标准化的进程,成为后续模型迭代与对比的重要参照。
当前挑战
该数据集所反映的核心挑战集中于两大层面。其一,在领域问题层面,尽管GOAT-70B-Storytelling在故事生成领域具备专门优化,但在面对需要精确逻辑推理与多步计算的数学任务(如GSM8K准确率仅40.79%)及部分专业学科知识(如大学数学准确率40%)时,其性能显著下滑,揭示了当前故事生成模型在跨领域泛化能力上的不足,即专用模型难以兼顾知识广度与推理深度。其二,在数据集构建过程中,挑战体现在评测任务的异构性与结果的可复现性上:需将来自不同基准(如MMLU的57个子任务、TruthfulQA的对抗性设置等)的异构格式统一为标准化配置,并确保每次评估运行的时间戳、配置与结果文件严格对应,避免因评测版本迭代或随机种子差异导致的偏差,这对数据管道的自动化与鲁棒性提出了极高要求。
常用场景
经典使用场景
在大型语言模型评测领域,该数据集作为Open LLM Leaderboard的标准化评估工具,被广泛用于衡量如GOAT-70B-Storytelling等模型的综合性能。它整合了ARC Challenge、HellaSwag、MMLU(涵盖57个学科)、TruthfulQA、Winogrande及GSM8K等经典基准测试,覆盖常识推理、知识问答、数学求解与语言歧义消解等核心能力维度。研究者通过加载该数据集中的特定配置(如harness_arc_challenge_25)与分割(如latest),可复现模型在各项任务上的细粒度表现,从而进行横向对比与纵向追踪。
衍生相关工作
该数据集衍生了一系列重要工作。HuggingFace团队基于其架构构建了Open LLM Leaderboard,成为社区公认的模型排行榜;后续研究如《Scaling Monosemanticity》借鉴其评测配置设计稀疏自编码器的评估协议。MMLU子集的细致评分催生了学科级性能分析工具(如MMLU-Pro),而GSM8K与HellaSwag的标准化流程被Meta的Llama系列模型采用作为内部验证集。更深远的影响在于,该数据集的元数据格式(含标准误与归一化分数)被EleutherAI的LM Evaluation Harness采纳为默认输出规范,推动了评测科学性的制度化。
数据集最近研究
最新研究方向
随着大语言模型(LLMs)在自然语言处理领域的迅猛发展,叙事生成(Storytelling)能力成为衡量模型创造性与连贯性的重要维度。GOAT-70B-Storytelling模型在Open LLM Leaderboard上的评估数据揭示了当前研究的前沿方向:模型不仅需在HellaSwag和WinoGrande等常识推理任务中表现稳健(如HellaSwag的acc_norm达到87.7%),更需在MMLU涵盖的57个学科(从抽象代数到病毒学)中展现广泛知识储备,其中在高中美国政府与政治(94.8%)和天文学(80.3%)等任务上的高分表明其知识广度。然而,在TruthfulQA任务中mc1仅38.2%的结果,凸显了模型在事实性与诚实性方面的局限,这恰是当前研究的热点——如何平衡模型的创造力与真实性。该数据集通过63个配置的细粒度评估,为优化模型在叙事任务中的逻辑一致性与知识准确性提供了关键基准,推动了LLMs向更具可信度的生成式AI演进。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作