five

open-llm-leaderboard-old/details_freecs__ThetaWave-14B-v0.1

收藏
Hugging Face2024-01-28 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_freecs__ThetaWave-14B-v0.1
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of freecs/ThetaWave-14B-v0.1 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [freecs/ThetaWave-14B-v0.1](https://huggingface.co/freecs/ThetaWave-14B-v0.1)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_freecs__ThetaWave-14B-v0.1\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-01-28T13:45:58.918363](https://huggingface.co/datasets/open-llm-leaderboard/details_freecs__ThetaWave-14B-v0.1/blob/main/results_2024-01-28T13-45-58.918363.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.5965032661159082,\n\ \ \"acc_stderr\": 0.032717164407508395,\n \"acc_norm\": 0.6089127016997782,\n\ \ \"acc_norm_stderr\": 0.033610152499338415,\n \"mc1\": 0.25091799265605874,\n\ \ \"mc1_stderr\": 0.015176985027707696,\n \"mc2\": 0.5040944748516392,\n\ \ \"mc2_stderr\": 0.01650838155954231\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.3703071672354949,\n \"acc_stderr\": 0.014111298751674948,\n\ \ \"acc_norm\": 0.4283276450511945,\n \"acc_norm_stderr\": 0.014460496367599013\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.3354909380601474,\n\ \ \"acc_stderr\": 0.004711968379069014,\n \"acc_norm\": 0.47092212706632147,\n\ \ \"acc_norm_stderr\": 0.004981336318033636\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.33,\n \"acc_stderr\": 0.04725815626252606,\n \ \ \"acc_norm\": 0.33,\n \"acc_norm_stderr\": 0.04725815626252606\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.5925925925925926,\n\ \ \"acc_stderr\": 0.04244633238353228,\n \"acc_norm\": 0.5925925925925926,\n\ \ \"acc_norm_stderr\": 0.04244633238353228\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.6973684210526315,\n \"acc_stderr\": 0.037385206761196686,\n\ \ \"acc_norm\": 0.6973684210526315,\n \"acc_norm_stderr\": 0.037385206761196686\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.6,\n\ \ \"acc_stderr\": 0.049236596391733084,\n \"acc_norm\": 0.6,\n \ \ \"acc_norm_stderr\": 0.049236596391733084\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.6867924528301886,\n \"acc_stderr\": 0.028544793319055326,\n\ \ \"acc_norm\": 0.6867924528301886,\n \"acc_norm_stderr\": 0.028544793319055326\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7361111111111112,\n\ \ \"acc_stderr\": 0.03685651095897532,\n \"acc_norm\": 0.7361111111111112,\n\ \ \"acc_norm_stderr\": 0.03685651095897532\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.45,\n \"acc_stderr\": 0.05,\n \"acc_norm\"\ : 0.45,\n \"acc_norm_stderr\": 0.05\n },\n \"harness|hendrycksTest-college_computer_science|5\"\ : {\n \"acc\": 0.48,\n \"acc_stderr\": 0.050211673156867795,\n \ \ \"acc_norm\": 0.48,\n \"acc_norm_stderr\": 0.050211673156867795\n \ \ },\n \"harness|hendrycksTest-college_mathematics|5\": {\n \"acc\"\ : 0.34,\n \"acc_stderr\": 0.04760952285695235,\n \"acc_norm\": 0.34,\n\ \ \"acc_norm_stderr\": 0.04760952285695235\n },\n \"harness|hendrycksTest-college_medicine|5\"\ : {\n \"acc\": 0.6184971098265896,\n \"acc_stderr\": 0.037038511930995215,\n\ \ \"acc_norm\": 0.6184971098265896,\n \"acc_norm_stderr\": 0.037038511930995215\n\ \ },\n \"harness|hendrycksTest-college_physics|5\": {\n \"acc\": 0.4019607843137255,\n\ \ \"acc_stderr\": 0.048786087144669955,\n \"acc_norm\": 0.4019607843137255,\n\ \ \"acc_norm_stderr\": 0.048786087144669955\n },\n \"harness|hendrycksTest-computer_security|5\"\ : {\n \"acc\": 0.75,\n \"acc_stderr\": 0.04351941398892446,\n \ \ \"acc_norm\": 0.75,\n \"acc_norm_stderr\": 0.04351941398892446\n \ \ },\n \"harness|hendrycksTest-conceptual_physics|5\": {\n \"acc\": 0.5787234042553191,\n\ \ \"acc_stderr\": 0.03227834510146267,\n \"acc_norm\": 0.5787234042553191,\n\ \ \"acc_norm_stderr\": 0.03227834510146267\n },\n \"harness|hendrycksTest-econometrics|5\"\ : {\n \"acc\": 0.4649122807017544,\n \"acc_stderr\": 0.046920083813689104,\n\ \ \"acc_norm\": 0.4649122807017544,\n \"acc_norm_stderr\": 0.046920083813689104\n\ \ },\n \"harness|hendrycksTest-electrical_engineering|5\": {\n \"acc\"\ : 0.5172413793103449,\n \"acc_stderr\": 0.04164188720169375,\n \"\ acc_norm\": 0.5172413793103449,\n \"acc_norm_stderr\": 0.04164188720169375\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.3915343915343915,\n \"acc_stderr\": 0.025138091388851112,\n \"\ acc_norm\": 0.3915343915343915,\n \"acc_norm_stderr\": 0.025138091388851112\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4126984126984127,\n\ \ \"acc_stderr\": 0.04403438954768177,\n \"acc_norm\": 0.4126984126984127,\n\ \ \"acc_norm_stderr\": 0.04403438954768177\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.35,\n \"acc_stderr\": 0.0479372485441102,\n \ \ \"acc_norm\": 0.35,\n \"acc_norm_stderr\": 0.0479372485441102\n },\n\ \ \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.6903225806451613,\n\ \ \"acc_stderr\": 0.026302774983517414,\n \"acc_norm\": 0.6903225806451613,\n\ \ \"acc_norm_stderr\": 0.026302774983517414\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.49261083743842365,\n \"acc_stderr\": 0.03517603540361008,\n\ \ \"acc_norm\": 0.49261083743842365,\n \"acc_norm_stderr\": 0.03517603540361008\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.65,\n \"acc_stderr\": 0.047937248544110196,\n \"acc_norm\"\ : 0.65,\n \"acc_norm_stderr\": 0.047937248544110196\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7636363636363637,\n \"acc_stderr\": 0.03317505930009182,\n\ \ \"acc_norm\": 0.7636363636363637,\n \"acc_norm_stderr\": 0.03317505930009182\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7626262626262627,\n \"acc_stderr\": 0.030313710538198906,\n \"\ acc_norm\": 0.7626262626262627,\n \"acc_norm_stderr\": 0.030313710538198906\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.8238341968911918,\n \"acc_stderr\": 0.027493504244548057,\n\ \ \"acc_norm\": 0.8238341968911918,\n \"acc_norm_stderr\": 0.027493504244548057\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.5743589743589743,\n \"acc_stderr\": 0.02506909438729653,\n \ \ \"acc_norm\": 0.5743589743589743,\n \"acc_norm_stderr\": 0.02506909438729653\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.34074074074074073,\n \"acc_stderr\": 0.02889774874113114,\n \ \ \"acc_norm\": 0.34074074074074073,\n \"acc_norm_stderr\": 0.02889774874113114\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6470588235294118,\n \"acc_stderr\": 0.031041941304059285,\n\ \ \"acc_norm\": 0.6470588235294118,\n \"acc_norm_stderr\": 0.031041941304059285\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.304635761589404,\n \"acc_stderr\": 0.03757949922943343,\n \"acc_norm\"\ : 0.304635761589404,\n \"acc_norm_stderr\": 0.03757949922943343\n },\n\ \ \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\": 0.8146788990825689,\n\ \ \"acc_stderr\": 0.01665927970029582,\n \"acc_norm\": 0.8146788990825689,\n\ \ \"acc_norm_stderr\": 0.01665927970029582\n },\n \"harness|hendrycksTest-high_school_statistics|5\"\ : {\n \"acc\": 0.4398148148148148,\n \"acc_stderr\": 0.03385177976044811,\n\ \ \"acc_norm\": 0.4398148148148148,\n \"acc_norm_stderr\": 0.03385177976044811\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.7745098039215687,\n \"acc_stderr\": 0.029331162294251735,\n \"\ acc_norm\": 0.7745098039215687,\n \"acc_norm_stderr\": 0.029331162294251735\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7805907172995781,\n \"acc_stderr\": 0.026939106581553945,\n \ \ \"acc_norm\": 0.7805907172995781,\n \"acc_norm_stderr\": 0.026939106581553945\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.672645739910314,\n\ \ \"acc_stderr\": 0.03149384670994131,\n \"acc_norm\": 0.672645739910314,\n\ \ \"acc_norm_stderr\": 0.03149384670994131\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.6946564885496184,\n \"acc_stderr\": 0.040393149787245605,\n\ \ \"acc_norm\": 0.6946564885496184,\n \"acc_norm_stderr\": 0.040393149787245605\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7851239669421488,\n \"acc_stderr\": 0.037494924487096966,\n \"\ acc_norm\": 0.7851239669421488,\n \"acc_norm_stderr\": 0.037494924487096966\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7129629629629629,\n\ \ \"acc_stderr\": 0.043733130409147614,\n \"acc_norm\": 0.7129629629629629,\n\ \ \"acc_norm_stderr\": 0.043733130409147614\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7361963190184049,\n \"acc_stderr\": 0.034624199316156234,\n\ \ \"acc_norm\": 0.7361963190184049,\n \"acc_norm_stderr\": 0.034624199316156234\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.4642857142857143,\n\ \ \"acc_stderr\": 0.04733667890053756,\n \"acc_norm\": 0.4642857142857143,\n\ \ \"acc_norm_stderr\": 0.04733667890053756\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7475728155339806,\n \"acc_stderr\": 0.04301250399690879,\n\ \ \"acc_norm\": 0.7475728155339806,\n \"acc_norm_stderr\": 0.04301250399690879\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8803418803418803,\n\ \ \"acc_stderr\": 0.021262719400407002,\n \"acc_norm\": 0.8803418803418803,\n\ \ \"acc_norm_stderr\": 0.021262719400407002\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.73,\n \"acc_stderr\": 0.0446196043338474,\n \ \ \"acc_norm\": 0.73,\n \"acc_norm_stderr\": 0.0446196043338474\n },\n\ \ \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.7777777777777778,\n\ \ \"acc_stderr\": 0.014866821664709588,\n \"acc_norm\": 0.7777777777777778,\n\ \ \"acc_norm_stderr\": 0.014866821664709588\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.6473988439306358,\n \"acc_stderr\": 0.025722802200895817,\n\ \ \"acc_norm\": 0.6473988439306358,\n \"acc_norm_stderr\": 0.025722802200895817\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.31508379888268156,\n\ \ \"acc_stderr\": 0.015536850852473642,\n \"acc_norm\": 0.31508379888268156,\n\ \ \"acc_norm_stderr\": 0.015536850852473642\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7156862745098039,\n \"acc_stderr\": 0.025829163272757485,\n\ \ \"acc_norm\": 0.7156862745098039,\n \"acc_norm_stderr\": 0.025829163272757485\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.6752411575562701,\n\ \ \"acc_stderr\": 0.02659678228769704,\n \"acc_norm\": 0.6752411575562701,\n\ \ \"acc_norm_stderr\": 0.02659678228769704\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7067901234567902,\n \"acc_stderr\": 0.025329888171900926,\n\ \ \"acc_norm\": 0.7067901234567902,\n \"acc_norm_stderr\": 0.025329888171900926\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.4574468085106383,\n \"acc_stderr\": 0.029719281272236848,\n \ \ \"acc_norm\": 0.4574468085106383,\n \"acc_norm_stderr\": 0.029719281272236848\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4348109517601043,\n\ \ \"acc_stderr\": 0.012661233805616302,\n \"acc_norm\": 0.4348109517601043,\n\ \ \"acc_norm_stderr\": 0.012661233805616302\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.625,\n \"acc_stderr\": 0.029408372932278746,\n \ \ \"acc_norm\": 0.625,\n \"acc_norm_stderr\": 0.029408372932278746\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6339869281045751,\n \"acc_stderr\": 0.01948802574552967,\n \ \ \"acc_norm\": 0.6339869281045751,\n \"acc_norm_stderr\": 0.01948802574552967\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6545454545454545,\n\ \ \"acc_stderr\": 0.04554619617541054,\n \"acc_norm\": 0.6545454545454545,\n\ \ \"acc_norm_stderr\": 0.04554619617541054\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.726530612244898,\n \"acc_stderr\": 0.028535560337128448,\n\ \ \"acc_norm\": 0.726530612244898,\n \"acc_norm_stderr\": 0.028535560337128448\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.7810945273631841,\n\ \ \"acc_stderr\": 0.029239174636647,\n \"acc_norm\": 0.7810945273631841,\n\ \ \"acc_norm_stderr\": 0.029239174636647\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.81,\n \"acc_stderr\": 0.03942772444036625,\n \ \ \"acc_norm\": 0.81,\n \"acc_norm_stderr\": 0.03942772444036625\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5180722891566265,\n\ \ \"acc_stderr\": 0.03889951252827216,\n \"acc_norm\": 0.5180722891566265,\n\ \ \"acc_norm_stderr\": 0.03889951252827216\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8654970760233918,\n \"acc_stderr\": 0.026168221344662297,\n\ \ \"acc_norm\": 0.8654970760233918,\n \"acc_norm_stderr\": 0.026168221344662297\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.25091799265605874,\n\ \ \"mc1_stderr\": 0.015176985027707696,\n \"mc2\": 0.5040944748516392,\n\ \ \"mc2_stderr\": 0.01650838155954231\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.654301499605367,\n \"acc_stderr\": 0.013366596951934375\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.0,\n \"acc_stderr\"\ : 0.0\n }\n}\n```" repo_url: https://huggingface.co/freecs/ThetaWave-14B-v0.1 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|arc:challenge|25_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-01-28T13-45-58.918363.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|gsm8k|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hellaswag|10_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-28T13-45-58.918363.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-management|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-28T13-45-58.918363.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|truthfulqa:mc|0_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-01-28T13-45-58.918363.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_01_28T13_45_58.918363 path: - '**/details_harness|winogrande|5_2024-01-28T13-45-58.918363.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-01-28T13-45-58.918363.parquet' - config_name: results data_files: - split: 2024_01_28T13_45_58.918363 path: - results_2024-01-28T13-45-58.918363.parquet - split: latest path: - results_2024-01-28T13-45-58.918363.parquet --- # Dataset Card for Evaluation run of freecs/ThetaWave-14B-v0.1 <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [freecs/ThetaWave-14B-v0.1](https://huggingface.co/freecs/ThetaWave-14B-v0.1) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_freecs__ThetaWave-14B-v0.1", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-01-28T13:45:58.918363](https://huggingface.co/datasets/open-llm-leaderboard/details_freecs__ThetaWave-14B-v0.1/blob/main/results_2024-01-28T13-45-58.918363.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.5965032661159082, "acc_stderr": 0.032717164407508395, "acc_norm": 0.6089127016997782, "acc_norm_stderr": 0.033610152499338415, "mc1": 0.25091799265605874, "mc1_stderr": 0.015176985027707696, "mc2": 0.5040944748516392, "mc2_stderr": 0.01650838155954231 }, "harness|arc:challenge|25": { "acc": 0.3703071672354949, "acc_stderr": 0.014111298751674948, "acc_norm": 0.4283276450511945, "acc_norm_stderr": 0.014460496367599013 }, "harness|hellaswag|10": { "acc": 0.3354909380601474, "acc_stderr": 0.004711968379069014, "acc_norm": 0.47092212706632147, "acc_norm_stderr": 0.004981336318033636 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.33, "acc_stderr": 0.04725815626252606, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252606 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.5925925925925926, "acc_stderr": 0.04244633238353228, "acc_norm": 0.5925925925925926, "acc_norm_stderr": 0.04244633238353228 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6973684210526315, "acc_stderr": 0.037385206761196686, "acc_norm": 0.6973684210526315, "acc_norm_stderr": 0.037385206761196686 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.6, "acc_stderr": 0.049236596391733084, "acc_norm": 0.6, "acc_norm_stderr": 0.049236596391733084 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6867924528301886, "acc_stderr": 0.028544793319055326, "acc_norm": 0.6867924528301886, "acc_norm_stderr": 0.028544793319055326 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7361111111111112, "acc_stderr": 0.03685651095897532, "acc_norm": 0.7361111111111112, "acc_norm_stderr": 0.03685651095897532 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.45, "acc_stderr": 0.05, "acc_norm": 0.45, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.48, "acc_stderr": 0.050211673156867795, "acc_norm": 0.48, "acc_norm_stderr": 0.050211673156867795 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.34, "acc_stderr": 0.04760952285695235, "acc_norm": 0.34, "acc_norm_stderr": 0.04760952285695235 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6184971098265896, "acc_stderr": 0.037038511930995215, "acc_norm": 0.6184971098265896, "acc_norm_stderr": 0.037038511930995215 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4019607843137255, "acc_stderr": 0.048786087144669955, "acc_norm": 0.4019607843137255, "acc_norm_stderr": 0.048786087144669955 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.75, "acc_stderr": 0.04351941398892446, "acc_norm": 0.75, "acc_norm_stderr": 0.04351941398892446 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5787234042553191, "acc_stderr": 0.03227834510146267, "acc_norm": 0.5787234042553191, "acc_norm_stderr": 0.03227834510146267 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.4649122807017544, "acc_stderr": 0.046920083813689104, "acc_norm": 0.4649122807017544, "acc_norm_stderr": 0.046920083813689104 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5172413793103449, "acc_stderr": 0.04164188720169375, "acc_norm": 0.5172413793103449, "acc_norm_stderr": 0.04164188720169375 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.3915343915343915, "acc_stderr": 0.025138091388851112, "acc_norm": 0.3915343915343915, "acc_norm_stderr": 0.025138091388851112 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4126984126984127, "acc_stderr": 0.04403438954768177, "acc_norm": 0.4126984126984127, "acc_norm_stderr": 0.04403438954768177 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.35, "acc_stderr": 0.0479372485441102, "acc_norm": 0.35, "acc_norm_stderr": 0.0479372485441102 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.6903225806451613, "acc_stderr": 0.026302774983517414, "acc_norm": 0.6903225806451613, "acc_norm_stderr": 0.026302774983517414 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.49261083743842365, "acc_stderr": 0.03517603540361008, "acc_norm": 0.49261083743842365, "acc_norm_stderr": 0.03517603540361008 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.65, "acc_stderr": 0.047937248544110196, "acc_norm": 0.65, "acc_norm_stderr": 0.047937248544110196 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7636363636363637, "acc_stderr": 0.03317505930009182, "acc_norm": 0.7636363636363637, "acc_norm_stderr": 0.03317505930009182 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7626262626262627, "acc_stderr": 0.030313710538198906, "acc_norm": 0.7626262626262627, "acc_norm_stderr": 0.030313710538198906 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8238341968911918, "acc_stderr": 0.027493504244548057, "acc_norm": 0.8238341968911918, "acc_norm_stderr": 0.027493504244548057 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.5743589743589743, "acc_stderr": 0.02506909438729653, "acc_norm": 0.5743589743589743, "acc_norm_stderr": 0.02506909438729653 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.34074074074074073, "acc_stderr": 0.02889774874113114, "acc_norm": 0.34074074074074073, "acc_norm_stderr": 0.02889774874113114 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6470588235294118, "acc_stderr": 0.031041941304059285, "acc_norm": 0.6470588235294118, "acc_norm_stderr": 0.031041941304059285 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.304635761589404, "acc_stderr": 0.03757949922943343, "acc_norm": 0.304635761589404, "acc_norm_stderr": 0.03757949922943343 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8146788990825689, "acc_stderr": 0.01665927970029582, "acc_norm": 0.8146788990825689, "acc_norm_stderr": 0.01665927970029582 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.4398148148148148, "acc_stderr": 0.03385177976044811, "acc_norm": 0.4398148148148148, "acc_norm_stderr": 0.03385177976044811 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.7745098039215687, "acc_stderr": 0.029331162294251735, "acc_norm": 0.7745098039215687, "acc_norm_stderr": 0.029331162294251735 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7805907172995781, "acc_stderr": 0.026939106581553945, "acc_norm": 0.7805907172995781, "acc_norm_stderr": 0.026939106581553945 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.672645739910314, "acc_stderr": 0.03149384670994131, "acc_norm": 0.672645739910314, "acc_norm_stderr": 0.03149384670994131 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.6946564885496184, "acc_stderr": 0.040393149787245605, "acc_norm": 0.6946564885496184, "acc_norm_stderr": 0.040393149787245605 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7851239669421488, "acc_stderr": 0.037494924487096966, "acc_norm": 0.7851239669421488, "acc_norm_stderr": 0.037494924487096966 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7129629629629629, "acc_stderr": 0.043733130409147614, "acc_norm": 0.7129629629629629, "acc_norm_stderr": 0.043733130409147614 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7361963190184049, "acc_stderr": 0.034624199316156234, "acc_norm": 0.7361963190184049, "acc_norm_stderr": 0.034624199316156234 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.4642857142857143, "acc_stderr": 0.04733667890053756, "acc_norm": 0.4642857142857143, "acc_norm_stderr": 0.04733667890053756 }, "harness|hendrycksTest-management|5": { "acc": 0.7475728155339806, "acc_stderr": 0.04301250399690879, "acc_norm": 0.7475728155339806, "acc_norm_stderr": 0.04301250399690879 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8803418803418803, "acc_stderr": 0.021262719400407002, "acc_norm": 0.8803418803418803, "acc_norm_stderr": 0.021262719400407002 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.73, "acc_stderr": 0.0446196043338474, "acc_norm": 0.73, "acc_norm_stderr": 0.0446196043338474 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.7777777777777778, "acc_stderr": 0.014866821664709588, "acc_norm": 0.7777777777777778, "acc_norm_stderr": 0.014866821664709588 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.6473988439306358, "acc_stderr": 0.025722802200895817, "acc_norm": 0.6473988439306358, "acc_norm_stderr": 0.025722802200895817 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.31508379888268156, "acc_stderr": 0.015536850852473642, "acc_norm": 0.31508379888268156, "acc_norm_stderr": 0.015536850852473642 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7156862745098039, "acc_stderr": 0.025829163272757485, "acc_norm": 0.7156862745098039, "acc_norm_stderr": 0.025829163272757485 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.6752411575562701, "acc_stderr": 0.02659678228769704, "acc_norm": 0.6752411575562701, "acc_norm_stderr": 0.02659678228769704 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7067901234567902, "acc_stderr": 0.025329888171900926, "acc_norm": 0.7067901234567902, "acc_norm_stderr": 0.025329888171900926 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.4574468085106383, "acc_stderr": 0.029719281272236848, "acc_norm": 0.4574468085106383, "acc_norm_stderr": 0.029719281272236848 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.4348109517601043, "acc_stderr": 0.012661233805616302, "acc_norm": 0.4348109517601043, "acc_norm_stderr": 0.012661233805616302 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.625, "acc_stderr": 0.029408372932278746, "acc_norm": 0.625, "acc_norm_stderr": 0.029408372932278746 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6339869281045751, "acc_stderr": 0.01948802574552967, "acc_norm": 0.6339869281045751, "acc_norm_stderr": 0.01948802574552967 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6545454545454545, "acc_stderr": 0.04554619617541054, "acc_norm": 0.6545454545454545, "acc_norm_stderr": 0.04554619617541054 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.726530612244898, "acc_stderr": 0.028535560337128448, "acc_norm": 0.726530612244898, "acc_norm_stderr": 0.028535560337128448 }, "harness|hendrycksTest-sociology|5": { "acc": 0.7810945273631841, "acc_stderr": 0.029239174636647, "acc_norm": 0.7810945273631841, "acc_norm_stderr": 0.029239174636647 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.81, "acc_stderr": 0.03942772444036625, "acc_norm": 0.81, "acc_norm_stderr": 0.03942772444036625 }, "harness|hendrycksTest-virology|5": { "acc": 0.5180722891566265, "acc_stderr": 0.03889951252827216, "acc_norm": 0.5180722891566265, "acc_norm_stderr": 0.03889951252827216 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8654970760233918, "acc_stderr": 0.026168221344662297, "acc_norm": 0.8654970760233918, "acc_norm_stderr": 0.026168221344662297 }, "harness|truthfulqa:mc|0": { "mc1": 0.25091799265605874, "mc1_stderr": 0.015176985027707696, "mc2": 0.5040944748516392, "mc2_stderr": 0.01650838155954231 }, "harness|winogrande|5": { "acc": 0.654301499605367, "acc_stderr": 0.013366596951934375 }, "harness|gsm8k|5": { "acc": 0.0, "acc_stderr": 0.0 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

该数据集是在模型freecs/ThetaWave-14B-v0.1的评估运行期间自动创建的,用于Open LLM Leaderboard

数据集结构

  • 数据集包含63个配置,每个配置对应一个评估任务。
  • 数据集从1次运行中创建,每个运行在每个配置中作为一个特定的拆分存在,拆分名称使用运行的时间戳。
  • "train"拆分始终指向最新的结果。
  • 额外的配置"results"存储所有运行的聚合结果,用于计算和显示Open LLM Leaderboard上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_freecs__ThetaWave-14B-v0.1", "harness_winogrande_5", split="train")

最新结果

以下是2024-01-28T13:45:58.918363运行的最新结果:

python { "all": { "acc": 0.5965032661159082, "acc_stderr": 0.032717164407508395, "acc_norm": 0.6089127016997782, "acc_norm_stderr": 0.033610152499338415, "mc1": 0.25091799265605874, "mc1_stderr": 0.015176985027707696, "mc2": 0.5040944748516392, "mc2_stderr": 0.01650838155954231 }, "harness|arc:challenge|25": { "acc": 0.3703071672354949, "acc_stderr": 0.014111298751674948, "acc_norm": 0.4283276450511945, "acc_norm_stderr": 0.014460496367599013 }, "harness|hellaswag|10": { "acc": 0.3354909380601474, "acc_stderr": 0.004711968379069014, "acc_norm": 0.47092212706632147, "acc_norm_stderr": 0.004981336318033636 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.33, "acc_stderr": 0.04725815626252606, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252606 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.5925925925925926, "acc_stderr": 0.04244633238353228, "acc_norm": 0.5925925925925926, "acc_norm_stderr": 0.04244633238353228 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6973684210526315, "acc_stderr": 0.037385206761196686, "acc_norm": 0.6973684210526315, "acc_norm_stderr": 0.037385206761196686 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.6, "acc_stderr": 0.049236596391733084, "acc_norm": 0.6, "acc_norm_stderr": 0.049236596391733084 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6867924528301886, "acc_stderr": 0.028544793319055326, "acc_norm": 0.6867924528301886, "acc_norm_stderr": 0.028544793319055326 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7361111111111112, "acc_stderr": 0.03685651095897532, "acc_norm": 0.7361111111111112, "acc_norm_stderr": 0.03685651095897532 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.45, "acc_stderr": 0.05, "acc_norm": 0.45, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.48, "acc_stderr": 0.050211673156867795, "acc_norm": 0.48, "acc_norm_stderr": 0.050211673156867795 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.34, "acc_stderr": 0.04760952285695235, "acc_norm": 0.34, "acc_norm_stderr": 0.04760952285695235 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6184971098265896, "acc_stderr": 0.037038511930995215, "acc_norm": 0.6184971098265896, "acc_norm_stderr": 0.037038511930995215 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4019607843137255, "acc_stderr": 0.048786087144669955, "acc_norm": 0.4019607843137255, "acc_norm_stderr": 0.048786087144669955 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.75, "acc_stderr": 0.04351941398892446, "acc_norm": 0.75, "acc_norm_stderr": 0.04351941398892446 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5787234042553191, "acc_stderr": 0.03227834510146267, "acc_norm": 0.5787234042553191, "acc_norm_stderr": 0.03227834510146267 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.4649122807017544, "acc_stderr": 0.046920083813689104, "acc_norm": 0.4649122807017544, "acc_norm_stderr": 0.046920083813689104 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5172413793103449, "acc_stderr": 0.04164188720169375, "acc_norm": 0.5172413793103449, "acc_norm_stderr": 0.04164188720169375 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.3915343915343915, "acc_stderr": 0.025138091388851112, "acc_norm": 0.3915343915343915, "acc_norm_stderr": 0.025138091388851112 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4126984126984127, "acc_stderr": 0.04403438954768177, "acc_norm": 0.4126984126984127, "acc_norm_stderr": 0.04403438954768177 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.35, "acc_stderr": 0.0479372485441102, "acc_norm": 0.35, "acc_norm_stderr": 0.0479372485441102 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.6903225806451613, "acc_stderr": 0.026302774983517414, "acc_norm": 0.6903225806451613, "acc_norm_stderr": 0.026302774983517414 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.49261083743842365, "acc_stderr": 0.03517603540361008, "acc_norm": 0.49261083743842365, "acc_norm_stderr": 0.03517603540361008 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.65, "acc_stderr": 0.047937248544110196, "acc_norm": 0.65, "acc_norm_stderr": 0.047937248544110196 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7636363636363637, "acc_stderr": 0.03317505930009182, "acc_norm": 0.7636363636363637, "acc_norm_stderr": 0.03317505930009182 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7626262626262627, "acc_stderr": 0.030313710538198906, "acc_norm": 0.7626262626262627, "acc_norm_stderr": 0.030313710538198906 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8238341968911918, "acc_stderr": 0.027493504244548057, "acc_norm": 0.8238341968911918, "acc_norm_stderr": 0.027493504244548057 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.5743589743589743, "acc_stderr": 0.02506909438729653, "acc_norm": 0.5743589743589743, "acc_norm_stderr": 0.02506909438729653 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.34074074074074073, "acc_stderr": 0.02889774874113114, "acc_norm": 0.34074074074074073, "acc_norm_stderr": 0.02889774874113114 },

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评测领域,该数据集是伴随Open LLM Leaderboard对模型freecs/ThetaWave-14B-v0.1进行自动化评估而生成的。数据集由63个配置构成,每个配置对应一项被评估的任务。其构建过程基于单次运行,每次运行的结果以时间戳命名的分割形式存储于各配置中,而'train'分割始终指向最新一次的评估结果。此外,一个额外的'results'配置用于汇聚所有运行的聚合指标,为排行榜上综合得分的计算与展示提供支撑。
特点
该数据集的核心特色在于其结构化的多任务评测体系,涵盖了ARC挑战、HellaSwag、GSM8K、TruthfulQA、WinoGrande以及涵盖广泛学科领域的Hendrycks测试集。每个子任务均以独立配置形式存在,便于研究者对模型在特定维度上的表现进行精细分析。数据集中不仅记录了各项任务的原始准确率,还提供了归一化准确率及其标准误差,为评估结果的可靠性提供了量化依据。最新结果以JSON格式呈现,直观展示了模型在57项不同基准上的综合性能图谱。
使用方法
研究者可通过HuggingFace的datasets库便捷地加载该数据集。具体而言,利用load_dataset函数,指定数据集名称和目标任务的配置名称(例如'harness_winogrande_5')以及所需的分割(如'train'),即可获取对应任务的详细评估数据。该设计支持对特定任务结果的独立访问,也允许通过'results'配置获取全局聚合指标,为模型的横向对比和纵向追踪提供了灵活高效的数据接口。
背景与挑战
背景概述
随着大语言模型(LLM)领域的蓬勃发展,如何系统性地评估模型在多样化自然语言理解与推理任务上的表现,已成为推动技术演进的关键议题。Open LLM Leaderboard由HuggingFace团队于2023年发起,旨在构建一个透明、可复现的模型性能基准平台,其核心研究问题聚焦于不同规模与架构的LLM在常识推理、知识问答及数学求解等维度的泛化能力。作为该平台评估流程的产物,open-llm-leaderboard/details_freecs__ThetaWave-14B-v0.1数据集记录了freecs/ThetaWave-14B-v0.1模型在63个任务配置上的详尽评测结果,涵盖ARC-Challenge、HellaSwag、GSM8K等主流基准,其影响力在于为研究者提供了跨任务、细粒度的性能剖析,从而助力模型优化方向的洞察。
当前挑战
该数据集所反映的核心挑战在于大语言模型在复杂推理与多学科知识应用上的局限性。具体而言,在数学推理任务(如GSM8K)中,模型准确率近乎为零,暴露出当前LLM在符号运算与步骤化推理上的深层短板;在常识推理(如HellaSwag)与科学问答(如ARC-Challenge)中,标准化准确率亦未突破50%,凸显了模型对隐含语义与因果关系的理解不足。构建过程中,评测流程需同步管理63个异构任务的数据格式与评估指标,确保各配置的parquet文件与时间戳分片正确关联,这要求严密的自动化流水线以避免版本错乱,同时需平衡多轮评测结果的增量更新与最新结果的即时呈现,对数据治理的鲁棒性提出了较高要求。
常用场景
经典使用场景
在大型语言模型(LLM)评估领域,Open LLM Leaderboard 的评测数据集被广泛用于标准化的模型性能比较。ThetaWave-14B-v0.1 模型的评估数据涵盖了 ARC-Challenge、HellaSwag、MMLU(涵盖 57 个学科)、TruthfulQA、Winogrande 和 GSM8K 等经典基准任务,每项任务均记录了准确率(acc)及标准化准确率(acc_norm)等细粒度指标,为研究者提供了可复现的评测流程,从而实现了不同模型在统一框架下的公平对比。
实际应用
在实际应用中,此类评测数据集可服务于模型选型与部署决策。例如,企业或研究机构在挑选用于客服对话、教育辅导、医疗问答等场景的 LLM 时,可依据数据集中的细分任务得分(如 MMLU 中的医学知识、GSM8K 中的数学能力)进行精准评估,从而选择最适配业务需求的模型。此外,评测结果也可作为模型迭代过程中衡量改进效果的量化依据。
衍生相关工作
基于 Open LLM Leaderboard 评测体系,衍生出了一系列重要工作。例如,研究者利用该数据集分析不同规模、不同架构模型的性能缩放规律;亦有工作通过细粒度任务得分构建模型能力图谱,或结合对抗性样本分析模型的鲁棒性。此外,该评测框架还被用于验证新的训练范式(如强化学习对齐、稀疏化训练)的有效性,推动了 LLM 评估方法的持续演进。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作