five

open-llm-leaderboard-old/details_AA051610__Q

收藏
Hugging Face2024-01-20 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_AA051610__Q
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of AA051610/Q dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [AA051610/Q](https://huggingface.co/AA051610/Q) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_AA051610__Q\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-01-20T09:11:03.066548](https://huggingface.co/datasets/open-llm-leaderboard/details_AA051610__Q/blob/main/results_2024-01-20T09-11-03.066548.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.7399080967661088,\n\ \ \"acc_stderr\": 0.02875112016204294,\n \"acc_norm\": 0.7517342136662964,\n\ \ \"acc_norm_stderr\": 0.02932406494014928,\n \"mc1\": 0.412484700122399,\n\ \ \"mc1_stderr\": 0.01723329939957122,\n \"mc2\": 0.5935958667241532,\n\ \ \"mc2_stderr\": 0.015329701989808613\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.6450511945392492,\n \"acc_stderr\": 0.013983036904094089,\n\ \ \"acc_norm\": 0.6697952218430034,\n \"acc_norm_stderr\": 0.013743085603760426\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6638119896434973,\n\ \ \"acc_stderr\": 0.004714386376337134,\n \"acc_norm\": 0.8567018522206732,\n\ \ \"acc_norm_stderr\": 0.0034966056729606905\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.43,\n \"acc_stderr\": 0.049756985195624284,\n \ \ \"acc_norm\": 0.43,\n \"acc_norm_stderr\": 0.049756985195624284\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.762962962962963,\n\ \ \"acc_stderr\": 0.03673731683969506,\n \"acc_norm\": 0.762962962962963,\n\ \ \"acc_norm_stderr\": 0.03673731683969506\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.8486842105263158,\n \"acc_stderr\": 0.02916263159684399,\n\ \ \"acc_norm\": 0.8486842105263158,\n \"acc_norm_stderr\": 0.02916263159684399\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.79,\n\ \ \"acc_stderr\": 0.040936018074033256,\n \"acc_norm\": 0.79,\n \ \ \"acc_norm_stderr\": 0.040936018074033256\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.8037735849056604,\n \"acc_stderr\": 0.024442388131100824,\n\ \ \"acc_norm\": 0.8037735849056604,\n \"acc_norm_stderr\": 0.024442388131100824\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.8472222222222222,\n\ \ \"acc_stderr\": 0.03008574324856567,\n \"acc_norm\": 0.8472222222222222,\n\ \ \"acc_norm_stderr\": 0.03008574324856567\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.47,\n \"acc_stderr\": 0.05016135580465919,\n \ \ \"acc_norm\": 0.47,\n \"acc_norm_stderr\": 0.05016135580465919\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.59,\n \"acc_stderr\": 0.04943110704237102,\n \"acc_norm\": 0.59,\n\ \ \"acc_norm_stderr\": 0.04943110704237102\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.4,\n \"acc_stderr\": 0.049236596391733084,\n \ \ \"acc_norm\": 0.4,\n \"acc_norm_stderr\": 0.049236596391733084\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6878612716763006,\n\ \ \"acc_stderr\": 0.035331333893236574,\n \"acc_norm\": 0.6878612716763006,\n\ \ \"acc_norm_stderr\": 0.035331333893236574\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.5392156862745098,\n \"acc_stderr\": 0.04959859966384181,\n\ \ \"acc_norm\": 0.5392156862745098,\n \"acc_norm_stderr\": 0.04959859966384181\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.77,\n \"acc_stderr\": 0.04229525846816505,\n \"acc_norm\": 0.77,\n\ \ \"acc_norm_stderr\": 0.04229525846816505\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.774468085106383,\n \"acc_stderr\": 0.027321078417387536,\n\ \ \"acc_norm\": 0.774468085106383,\n \"acc_norm_stderr\": 0.027321078417387536\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.5701754385964912,\n\ \ \"acc_stderr\": 0.04657047260594964,\n \"acc_norm\": 0.5701754385964912,\n\ \ \"acc_norm_stderr\": 0.04657047260594964\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.7310344827586207,\n \"acc_stderr\": 0.036951833116502325,\n\ \ \"acc_norm\": 0.7310344827586207,\n \"acc_norm_stderr\": 0.036951833116502325\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.7513227513227513,\n \"acc_stderr\": 0.02226181769240016,\n \"\ acc_norm\": 0.7513227513227513,\n \"acc_norm_stderr\": 0.02226181769240016\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.5238095238095238,\n\ \ \"acc_stderr\": 0.04467062628403273,\n \"acc_norm\": 0.5238095238095238,\n\ \ \"acc_norm_stderr\": 0.04467062628403273\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.52,\n \"acc_stderr\": 0.050211673156867795,\n \ \ \"acc_norm\": 0.52,\n \"acc_norm_stderr\": 0.050211673156867795\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\"\ : 0.8870967741935484,\n \"acc_stderr\": 0.01800360332586363,\n \"\ acc_norm\": 0.8870967741935484,\n \"acc_norm_stderr\": 0.01800360332586363\n\ \ },\n \"harness|hendrycksTest-high_school_chemistry|5\": {\n \"acc\"\ : 0.5960591133004927,\n \"acc_stderr\": 0.03452453903822033,\n \"\ acc_norm\": 0.5960591133004927,\n \"acc_norm_stderr\": 0.03452453903822033\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.77,\n \"acc_stderr\": 0.042295258468165044,\n \"acc_norm\"\ : 0.77,\n \"acc_norm_stderr\": 0.042295258468165044\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.8606060606060606,\n \"acc_stderr\": 0.027045948825865414,\n\ \ \"acc_norm\": 0.8606060606060606,\n \"acc_norm_stderr\": 0.027045948825865414\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.9040404040404041,\n \"acc_stderr\": 0.020984808610047933,\n \"\ acc_norm\": 0.9040404040404041,\n \"acc_norm_stderr\": 0.020984808610047933\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.9585492227979274,\n \"acc_stderr\": 0.014385432857476442,\n\ \ \"acc_norm\": 0.9585492227979274,\n \"acc_norm_stderr\": 0.014385432857476442\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.8128205128205128,\n \"acc_stderr\": 0.01977660108655004,\n \ \ \"acc_norm\": 0.8128205128205128,\n \"acc_norm_stderr\": 0.01977660108655004\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.4740740740740741,\n \"acc_stderr\": 0.03044452852881074,\n \ \ \"acc_norm\": 0.4740740740740741,\n \"acc_norm_stderr\": 0.03044452852881074\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.8445378151260504,\n \"acc_stderr\": 0.023536818625398897,\n\ \ \"acc_norm\": 0.8445378151260504,\n \"acc_norm_stderr\": 0.023536818625398897\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.48344370860927155,\n \"acc_stderr\": 0.040802441856289715,\n \"\ acc_norm\": 0.48344370860927155,\n \"acc_norm_stderr\": 0.040802441856289715\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.9119266055045872,\n \"acc_stderr\": 0.012150743719481655,\n \"\ acc_norm\": 0.9119266055045872,\n \"acc_norm_stderr\": 0.012150743719481655\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.6574074074074074,\n \"acc_stderr\": 0.03236585252602158,\n \"\ acc_norm\": 0.6574074074074074,\n \"acc_norm_stderr\": 0.03236585252602158\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.9166666666666666,\n \"acc_stderr\": 0.019398452135813905,\n \"\ acc_norm\": 0.9166666666666666,\n \"acc_norm_stderr\": 0.019398452135813905\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.8860759493670886,\n \"acc_stderr\": 0.020681745135884565,\n \ \ \"acc_norm\": 0.8860759493670886,\n \"acc_norm_stderr\": 0.020681745135884565\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.7847533632286996,\n\ \ \"acc_stderr\": 0.027584066602208274,\n \"acc_norm\": 0.7847533632286996,\n\ \ \"acc_norm_stderr\": 0.027584066602208274\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.8854961832061069,\n \"acc_stderr\": 0.027927473753597446,\n\ \ \"acc_norm\": 0.8854961832061069,\n \"acc_norm_stderr\": 0.027927473753597446\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.8842975206611571,\n \"acc_stderr\": 0.029199802455622793,\n \"\ acc_norm\": 0.8842975206611571,\n \"acc_norm_stderr\": 0.029199802455622793\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.8888888888888888,\n\ \ \"acc_stderr\": 0.030381596756651672,\n \"acc_norm\": 0.8888888888888888,\n\ \ \"acc_norm_stderr\": 0.030381596756651672\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.8588957055214724,\n \"acc_stderr\": 0.027351605518389752,\n\ \ \"acc_norm\": 0.8588957055214724,\n \"acc_norm_stderr\": 0.027351605518389752\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.5714285714285714,\n\ \ \"acc_stderr\": 0.04697113923010213,\n \"acc_norm\": 0.5714285714285714,\n\ \ \"acc_norm_stderr\": 0.04697113923010213\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.8932038834951457,\n \"acc_stderr\": 0.030581088928331366,\n\ \ \"acc_norm\": 0.8932038834951457,\n \"acc_norm_stderr\": 0.030581088928331366\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.9487179487179487,\n\ \ \"acc_stderr\": 0.014450181176872733,\n \"acc_norm\": 0.9487179487179487,\n\ \ \"acc_norm_stderr\": 0.014450181176872733\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.87,\n \"acc_stderr\": 0.033799766898963086,\n \ \ \"acc_norm\": 0.87,\n \"acc_norm_stderr\": 0.033799766898963086\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.9042145593869731,\n\ \ \"acc_stderr\": 0.010524031079055834,\n \"acc_norm\": 0.9042145593869731,\n\ \ \"acc_norm_stderr\": 0.010524031079055834\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.791907514450867,\n \"acc_stderr\": 0.021855255263421795,\n\ \ \"acc_norm\": 0.791907514450867,\n \"acc_norm_stderr\": 0.021855255263421795\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.6782122905027933,\n\ \ \"acc_stderr\": 0.015624236160792584,\n \"acc_norm\": 0.6782122905027933,\n\ \ \"acc_norm_stderr\": 0.015624236160792584\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.8398692810457516,\n \"acc_stderr\": 0.020998740930362306,\n\ \ \"acc_norm\": 0.8398692810457516,\n \"acc_norm_stderr\": 0.020998740930362306\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7942122186495176,\n\ \ \"acc_stderr\": 0.022961339906764244,\n \"acc_norm\": 0.7942122186495176,\n\ \ \"acc_norm_stderr\": 0.022961339906764244\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.8395061728395061,\n \"acc_stderr\": 0.02042395535477803,\n\ \ \"acc_norm\": 0.8395061728395061,\n \"acc_norm_stderr\": 0.02042395535477803\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.6028368794326241,\n \"acc_stderr\": 0.0291898056735871,\n \ \ \"acc_norm\": 0.6028368794326241,\n \"acc_norm_stderr\": 0.0291898056735871\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.5925684485006519,\n\ \ \"acc_stderr\": 0.01254947371421222,\n \"acc_norm\": 0.5925684485006519,\n\ \ \"acc_norm_stderr\": 0.01254947371421222\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.8235294117647058,\n \"acc_stderr\": 0.023157468308559352,\n\ \ \"acc_norm\": 0.8235294117647058,\n \"acc_norm_stderr\": 0.023157468308559352\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.7924836601307189,\n \"acc_stderr\": 0.01640592427010324,\n \ \ \"acc_norm\": 0.7924836601307189,\n \"acc_norm_stderr\": 0.01640592427010324\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.7272727272727273,\n\ \ \"acc_stderr\": 0.04265792110940589,\n \"acc_norm\": 0.7272727272727273,\n\ \ \"acc_norm_stderr\": 0.04265792110940589\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.8122448979591836,\n \"acc_stderr\": 0.025000256039546198,\n\ \ \"acc_norm\": 0.8122448979591836,\n \"acc_norm_stderr\": 0.025000256039546198\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8805970149253731,\n\ \ \"acc_stderr\": 0.02292879327721974,\n \"acc_norm\": 0.8805970149253731,\n\ \ \"acc_norm_stderr\": 0.02292879327721974\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.92,\n \"acc_stderr\": 0.0272659924344291,\n \ \ \"acc_norm\": 0.92,\n \"acc_norm_stderr\": 0.0272659924344291\n },\n\ \ \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5662650602409639,\n\ \ \"acc_stderr\": 0.03858158940685516,\n \"acc_norm\": 0.5662650602409639,\n\ \ \"acc_norm_stderr\": 0.03858158940685516\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.9005847953216374,\n \"acc_stderr\": 0.02294902557935504,\n\ \ \"acc_norm\": 0.9005847953216374,\n \"acc_norm_stderr\": 0.02294902557935504\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.412484700122399,\n\ \ \"mc1_stderr\": 0.01723329939957122,\n \"mc2\": 0.5935958667241532,\n\ \ \"mc2_stderr\": 0.015329701989808613\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.8003157063930545,\n \"acc_stderr\": 0.011235328382625845\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.19939347990902198,\n \ \ \"acc_stderr\": 0.011005438029475656\n }\n}\n```" repo_url: https://huggingface.co/AA051610/Q leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|arc:challenge|25_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-01-20T09-11-03.066548.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|gsm8k|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hellaswag|10_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-20T09-11-03.066548.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-management|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-20T09-11-03.066548.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|truthfulqa:mc|0_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-01-20T09-11-03.066548.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_01_20T09_11_03.066548 path: - '**/details_harness|winogrande|5_2024-01-20T09-11-03.066548.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-01-20T09-11-03.066548.parquet' - config_name: results data_files: - split: 2024_01_20T09_11_03.066548 path: - results_2024-01-20T09-11-03.066548.parquet - split: latest path: - results_2024-01-20T09-11-03.066548.parquet --- # Dataset Card for Evaluation run of AA051610/Q <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [AA051610/Q](https://huggingface.co/AA051610/Q) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_AA051610__Q", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-01-20T09:11:03.066548](https://huggingface.co/datasets/open-llm-leaderboard/details_AA051610__Q/blob/main/results_2024-01-20T09-11-03.066548.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.7399080967661088, "acc_stderr": 0.02875112016204294, "acc_norm": 0.7517342136662964, "acc_norm_stderr": 0.02932406494014928, "mc1": 0.412484700122399, "mc1_stderr": 0.01723329939957122, "mc2": 0.5935958667241532, "mc2_stderr": 0.015329701989808613 }, "harness|arc:challenge|25": { "acc": 0.6450511945392492, "acc_stderr": 0.013983036904094089, "acc_norm": 0.6697952218430034, "acc_norm_stderr": 0.013743085603760426 }, "harness|hellaswag|10": { "acc": 0.6638119896434973, "acc_stderr": 0.004714386376337134, "acc_norm": 0.8567018522206732, "acc_norm_stderr": 0.0034966056729606905 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.43, "acc_stderr": 0.049756985195624284, "acc_norm": 0.43, "acc_norm_stderr": 0.049756985195624284 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.762962962962963, "acc_stderr": 0.03673731683969506, "acc_norm": 0.762962962962963, "acc_norm_stderr": 0.03673731683969506 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.8486842105263158, "acc_stderr": 0.02916263159684399, "acc_norm": 0.8486842105263158, "acc_norm_stderr": 0.02916263159684399 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.79, "acc_stderr": 0.040936018074033256, "acc_norm": 0.79, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.8037735849056604, "acc_stderr": 0.024442388131100824, "acc_norm": 0.8037735849056604, "acc_norm_stderr": 0.024442388131100824 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.8472222222222222, "acc_stderr": 0.03008574324856567, "acc_norm": 0.8472222222222222, "acc_norm_stderr": 0.03008574324856567 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.47, "acc_stderr": 0.05016135580465919, "acc_norm": 0.47, "acc_norm_stderr": 0.05016135580465919 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.59, "acc_stderr": 0.04943110704237102, "acc_norm": 0.59, "acc_norm_stderr": 0.04943110704237102 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.4, "acc_stderr": 0.049236596391733084, "acc_norm": 0.4, "acc_norm_stderr": 0.049236596391733084 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6878612716763006, "acc_stderr": 0.035331333893236574, "acc_norm": 0.6878612716763006, "acc_norm_stderr": 0.035331333893236574 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.5392156862745098, "acc_stderr": 0.04959859966384181, "acc_norm": 0.5392156862745098, "acc_norm_stderr": 0.04959859966384181 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.77, "acc_stderr": 0.04229525846816505, "acc_norm": 0.77, "acc_norm_stderr": 0.04229525846816505 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.774468085106383, "acc_stderr": 0.027321078417387536, "acc_norm": 0.774468085106383, "acc_norm_stderr": 0.027321078417387536 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5701754385964912, "acc_stderr": 0.04657047260594964, "acc_norm": 0.5701754385964912, "acc_norm_stderr": 0.04657047260594964 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.7310344827586207, "acc_stderr": 0.036951833116502325, "acc_norm": 0.7310344827586207, "acc_norm_stderr": 0.036951833116502325 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.7513227513227513, "acc_stderr": 0.02226181769240016, "acc_norm": 0.7513227513227513, "acc_norm_stderr": 0.02226181769240016 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.5238095238095238, "acc_stderr": 0.04467062628403273, "acc_norm": 0.5238095238095238, "acc_norm_stderr": 0.04467062628403273 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.52, "acc_stderr": 0.050211673156867795, "acc_norm": 0.52, "acc_norm_stderr": 0.050211673156867795 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.8870967741935484, "acc_stderr": 0.01800360332586363, "acc_norm": 0.8870967741935484, "acc_norm_stderr": 0.01800360332586363 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5960591133004927, "acc_stderr": 0.03452453903822033, "acc_norm": 0.5960591133004927, "acc_norm_stderr": 0.03452453903822033 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.77, "acc_stderr": 0.042295258468165044, "acc_norm": 0.77, "acc_norm_stderr": 0.042295258468165044 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.8606060606060606, "acc_stderr": 0.027045948825865414, "acc_norm": 0.8606060606060606, "acc_norm_stderr": 0.027045948825865414 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.9040404040404041, "acc_stderr": 0.020984808610047933, "acc_norm": 0.9040404040404041, "acc_norm_stderr": 0.020984808610047933 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.9585492227979274, "acc_stderr": 0.014385432857476442, "acc_norm": 0.9585492227979274, "acc_norm_stderr": 0.014385432857476442 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.8128205128205128, "acc_stderr": 0.01977660108655004, "acc_norm": 0.8128205128205128, "acc_norm_stderr": 0.01977660108655004 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.4740740740740741, "acc_stderr": 0.03044452852881074, "acc_norm": 0.4740740740740741, "acc_norm_stderr": 0.03044452852881074 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.8445378151260504, "acc_stderr": 0.023536818625398897, "acc_norm": 0.8445378151260504, "acc_norm_stderr": 0.023536818625398897 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.48344370860927155, "acc_stderr": 0.040802441856289715, "acc_norm": 0.48344370860927155, "acc_norm_stderr": 0.040802441856289715 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.9119266055045872, "acc_stderr": 0.012150743719481655, "acc_norm": 0.9119266055045872, "acc_norm_stderr": 0.012150743719481655 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.6574074074074074, "acc_stderr": 0.03236585252602158, "acc_norm": 0.6574074074074074, "acc_norm_stderr": 0.03236585252602158 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.9166666666666666, "acc_stderr": 0.019398452135813905, "acc_norm": 0.9166666666666666, "acc_norm_stderr": 0.019398452135813905 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.8860759493670886, "acc_stderr": 0.020681745135884565, "acc_norm": 0.8860759493670886, "acc_norm_stderr": 0.020681745135884565 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.7847533632286996, "acc_stderr": 0.027584066602208274, "acc_norm": 0.7847533632286996, "acc_norm_stderr": 0.027584066602208274 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.8854961832061069, "acc_stderr": 0.027927473753597446, "acc_norm": 0.8854961832061069, "acc_norm_stderr": 0.027927473753597446 }, "harness|hendrycksTest-international_law|5": { "acc": 0.8842975206611571, "acc_stderr": 0.029199802455622793, "acc_norm": 0.8842975206611571, "acc_norm_stderr": 0.029199802455622793 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.8888888888888888, "acc_stderr": 0.030381596756651672, "acc_norm": 0.8888888888888888, "acc_norm_stderr": 0.030381596756651672 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.8588957055214724, "acc_stderr": 0.027351605518389752, "acc_norm": 0.8588957055214724, "acc_norm_stderr": 0.027351605518389752 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.5714285714285714, "acc_stderr": 0.04697113923010213, "acc_norm": 0.5714285714285714, "acc_norm_stderr": 0.04697113923010213 }, "harness|hendrycksTest-management|5": { "acc": 0.8932038834951457, "acc_stderr": 0.030581088928331366, "acc_norm": 0.8932038834951457, "acc_norm_stderr": 0.030581088928331366 }, "harness|hendrycksTest-marketing|5": { "acc": 0.9487179487179487, "acc_stderr": 0.014450181176872733, "acc_norm": 0.9487179487179487, "acc_norm_stderr": 0.014450181176872733 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.87, "acc_stderr": 0.033799766898963086, "acc_norm": 0.87, "acc_norm_stderr": 0.033799766898963086 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.9042145593869731, "acc_stderr": 0.010524031079055834, "acc_norm": 0.9042145593869731, "acc_norm_stderr": 0.010524031079055834 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.791907514450867, "acc_stderr": 0.021855255263421795, "acc_norm": 0.791907514450867, "acc_norm_stderr": 0.021855255263421795 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.6782122905027933, "acc_stderr": 0.015624236160792584, "acc_norm": 0.6782122905027933, "acc_norm_stderr": 0.015624236160792584 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.8398692810457516, "acc_stderr": 0.020998740930362306, "acc_norm": 0.8398692810457516, "acc_norm_stderr": 0.020998740930362306 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7942122186495176, "acc_stderr": 0.022961339906764244, "acc_norm": 0.7942122186495176, "acc_norm_stderr": 0.022961339906764244 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.8395061728395061, "acc_stderr": 0.02042395535477803, "acc_norm": 0.8395061728395061, "acc_norm_stderr": 0.02042395535477803 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.6028368794326241, "acc_stderr": 0.0291898056735871, "acc_norm": 0.6028368794326241, "acc_norm_stderr": 0.0291898056735871 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.5925684485006519, "acc_stderr": 0.01254947371421222, "acc_norm": 0.5925684485006519, "acc_norm_stderr": 0.01254947371421222 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.8235294117647058, "acc_stderr": 0.023157468308559352, "acc_norm": 0.8235294117647058, "acc_norm_stderr": 0.023157468308559352 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.7924836601307189, "acc_stderr": 0.01640592427010324, "acc_norm": 0.7924836601307189, "acc_norm_stderr": 0.01640592427010324 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.7272727272727273, "acc_stderr": 0.04265792110940589, "acc_norm": 0.7272727272727273, "acc_norm_stderr": 0.04265792110940589 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.8122448979591836, "acc_stderr": 0.025000256039546198, "acc_norm": 0.8122448979591836, "acc_norm_stderr": 0.025000256039546198 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8805970149253731, "acc_stderr": 0.02292879327721974, "acc_norm": 0.8805970149253731, "acc_norm_stderr": 0.02292879327721974 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.92, "acc_stderr": 0.0272659924344291, "acc_norm": 0.92, "acc_norm_stderr": 0.0272659924344291 }, "harness|hendrycksTest-virology|5": { "acc": 0.5662650602409639, "acc_stderr": 0.03858158940685516, "acc_norm": 0.5662650602409639, "acc_norm_stderr": 0.03858158940685516 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.9005847953216374, "acc_stderr": 0.02294902557935504, "acc_norm": 0.9005847953216374, "acc_norm_stderr": 0.02294902557935504 }, "harness|truthfulqa:mc|0": { "mc1": 0.412484700122399, "mc1_stderr": 0.01723329939957122, "mc2": 0.5935958667241532, "mc2_stderr": 0.015329701989808613 }, "harness|winogrande|5": { "acc": 0.8003157063930545, "acc_stderr": 0.011235328382625845 }, "harness|gsm8k|5": { "acc": 0.19939347990902198, "acc_stderr": 0.011005438029475656 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集创建

数据集结构

  • 数据集包含 63 个配置,每个配置对应一个评估任务。
  • 数据集从 1 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 额外的 "results" 配置存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_AA051610__Q", "harness_winogrande_5", split="train")

最新结果

  • 最新结果来自 2024-01-20T09:11:03.066548 运行,包含多个任务的准确率和标准误差。

配置详情

  • harness_arc_challenge_25

    • 分割:2024_01_20T09_11_03.066548, latest
    • 路径:**/details_harness|arc:challenge|25_2024-01-20T09-11-03.066548.parquet
  • harness_gsm8k_5

    • 分割:2024_01_20T09_11_03.066548, latest
    • 路径:**/details_harness|gsm8k|5_2024-01-20T09-11-03.066548.parquet
  • harness_hellaswag_10

    • 分割:2024_01_20T09_11_03.066548, latest
    • 路径:**/details_harness|hellaswag|10_2024-01-20T09-11-03.066548.parquet
  • harness_hendrycksTest_5

    • 分割:2024_01_20T09_11_03.066548
    • 路径:多个任务的详细路径
搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,open-llm-leaderboard-old/details_AA051610__Q数据集通过自动化流程构建而成。该数据集源自对模型AA051610/Q在Open LLM Leaderboard上的评估运行,系统自动捕获了63种不同配置,每种配置对应一个特定的评估任务。评估运行的时间戳被用作数据分割的标识,确保了每次实验的独立性,而“train”分割始终指向最新的评估结果,这种设计使得数据集能够动态反映模型性能的演进轨迹。
特点
该数据集的核心特征在于其结构化的多任务评估框架,涵盖了从常识推理到专业学科知识的广泛领域。数据集不仅包含ARC挑战赛、HellaSwag等通用基准任务,还整合了涵盖数学、物理、法律、医学等57个细分学科的MMLU(Hendrycks测试)任务,提供了模型在不同认知维度上的细粒度性能剖析。每个任务配置均存储了详细的评估指标,包括准确率及其标准误差,为模型能力的横向对比与纵向追踪奠定了数据基础。
使用方法
研究人员可通过Hugging Face的datasets库便捷地加载此数据集进行深入分析。例如,使用load_dataset函数并指定数据集名称、具体任务配置(如“harness_winogrande_5”)及分割(如“train”),即可获取该任务下的详细评估记录。数据集支持按时间戳分割访问历史运行数据,亦可通过“latest”分割获取最新结果,这种灵活的访问机制便于进行模型性能的时序分析、跨任务比较以及评估方法的有效性研究。
背景与挑战
背景概述
在大型语言模型(LLM)迅猛发展的浪潮中,评估其综合能力成为推动技术进步的关键环节。HuggingFace团队于2023年推出的Open LLM Leaderboard,旨在构建一个标准化、透明化的模型性能评估平台。该平台通过整合ARC挑战赛、HellaSwag、MMLU以及TruthfulQA等多个权威基准测试,对模型的知识广度、推理能力及事实准确性进行系统化测评。数据集'open-llm-leaderboard-old/details_AA051610__Q'正是这一评估体系下的产物,它自动记录了模型AA051610/Q在2024年1月20日的详细评测结果,涵盖了63项具体任务配置,为研究社区提供了模型性能的微观洞察,促进了模型间的公平比较与迭代优化。
当前挑战
该数据集所应对的核心挑战在于解决大型语言模型评估的复杂性与多维性。传统单一指标难以全面衡量模型在多样化任务上的表现,而Open LLM Leaderboard通过集成多个异构基准,旨在克服评估碎片化问题,但这也引入了跨任务指标归一化与结果可比性的技术难题。在数据集构建过程中,挑战主要体现在自动化评测流程的可靠性与数据一致性维护上。由于评测涉及大量动态任务与多次运行,确保每项评测结果的精确记录、时间戳分割的正确映射以及最新结果的实时同步,需要高度稳健的工程架构。此外,处理不同评测任务可能产生的数据格式差异与错误容忍,也是保障数据集质量的关键。
常用场景
经典使用场景
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard的评估运行产物,其经典使用场景体现在为模型AA051610/Q提供详尽的性能基准测试。通过涵盖ARC挑战赛、HellaSwag、MMLU(HendrycksTest)以及TruthfulQA等63项多样化任务配置,数据集为研究者呈现了模型在常识推理、语言理解、专业知识及真实性等多维度的量化表现。这种结构化的评估框架使得模型间的横向比较成为可能,为学术社区提供了透明且可复现的评估标准。
衍生相关工作
围绕该数据集衍生的经典工作主要集中在评估方法学与模型能力分析两大方向。一方面,研究者基于其多任务评估结构,发展了更细粒度的性能诊断工具与偏差检测框架,深入探究模型失败模式。另一方面,该数据集常被引用作为基准模型性能的对照,支撑了诸如指令微调、思维链提示以及模型融合等后续优化技术的效果验证。这些工作共同深化了对大型语言模型能力边界与泛化机制的理解,推动了评估生态的持续演进。
数据集最近研究
最新研究方向
在大型语言模型评估领域,open-llm-leaderboard数据集作为模型性能标准化测试的重要平台,其最新研究方向聚焦于多维度能力评估与基准测试的精细化发展。随着模型规模的扩大与架构的演进,研究前沿正从通用任务评估转向专业化、细粒度知识领域的深度测评,如涵盖数学推理、科学常识及伦理判断的复杂任务。热点事件体现在开源社区对模型透明性与可比性的强烈需求,推动评估框架向动态化、可复现性演进,该数据集的持续更新为模型迭代提供了关键参照,促进了学术与工业界在模型优化与对齐研究上的协同进步,对推动人工智能技术向可靠、可信方向发展具有深远意义。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作