five

open-llm-leaderboard-old/details_CohereForAI__aya-23-8B

收藏
Hugging Face2024-05-25 更新2024-06-26 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_CohereForAI__aya-23-8B
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of CohereForAI/aya-23-8B dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [CohereForAI/aya-23-8B](https://huggingface.co/CohereForAI/aya-23-8B) on the [Open\ \ LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_CohereForAI__aya-23-8B\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-05-25T19:35:27.229141](https://huggingface.co/datasets/open-llm-leaderboard/details_CohereForAI__aya-23-8B/blob/main/results_2024-05-25T19-35-27.229141.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.5630484989107499,\n\ \ \"acc_stderr\": 0.033643561998378806,\n \"acc_norm\": 0.566585264377983,\n\ \ \"acc_norm_stderr\": 0.03432638891884135,\n \"mc1\": 0.2876376988984088,\n\ \ \"mc1_stderr\": 0.01584631510139481,\n \"mc2\": 0.4293654330376981,\n\ \ \"mc2_stderr\": 0.014566008003477715\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.4931740614334471,\n \"acc_stderr\": 0.014610029151379813,\n\ \ \"acc_norm\": 0.5349829351535836,\n \"acc_norm_stderr\": 0.01457558392201967\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.5766779525990838,\n\ \ \"acc_stderr\": 0.004930757390897348,\n \"acc_norm\": 0.7803226448914559,\n\ \ \"acc_norm_stderr\": 0.00413181879771388\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.27,\n \"acc_stderr\": 0.04461960433384741,\n \ \ \"acc_norm\": 0.27,\n \"acc_norm_stderr\": 0.04461960433384741\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.4740740740740741,\n\ \ \"acc_stderr\": 0.04313531696750574,\n \"acc_norm\": 0.4740740740740741,\n\ \ \"acc_norm_stderr\": 0.04313531696750574\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.5986842105263158,\n \"acc_stderr\": 0.039889037033362836,\n\ \ \"acc_norm\": 0.5986842105263158,\n \"acc_norm_stderr\": 0.039889037033362836\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.49,\n\ \ \"acc_stderr\": 0.05024183937956912,\n \"acc_norm\": 0.49,\n \ \ \"acc_norm_stderr\": 0.05024183937956912\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.5622641509433962,\n \"acc_stderr\": 0.030533338430467516,\n\ \ \"acc_norm\": 0.5622641509433962,\n \"acc_norm_stderr\": 0.030533338430467516\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.6319444444444444,\n\ \ \"acc_stderr\": 0.04032999053960718,\n \"acc_norm\": 0.6319444444444444,\n\ \ \"acc_norm_stderr\": 0.04032999053960718\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.41,\n \"acc_stderr\": 0.049431107042371025,\n \ \ \"acc_norm\": 0.41,\n \"acc_norm_stderr\": 0.049431107042371025\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"\ acc\": 0.5,\n \"acc_stderr\": 0.050251890762960605,\n \"acc_norm\"\ : 0.5,\n \"acc_norm_stderr\": 0.050251890762960605\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.35,\n \"acc_stderr\": 0.04793724854411019,\n \ \ \"acc_norm\": 0.35,\n \"acc_norm_stderr\": 0.04793724854411019\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.5549132947976878,\n\ \ \"acc_stderr\": 0.03789401760283647,\n \"acc_norm\": 0.5549132947976878,\n\ \ \"acc_norm_stderr\": 0.03789401760283647\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.2549019607843137,\n \"acc_stderr\": 0.04336432707993177,\n\ \ \"acc_norm\": 0.2549019607843137,\n \"acc_norm_stderr\": 0.04336432707993177\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.71,\n \"acc_stderr\": 0.045604802157206845,\n \"acc_norm\": 0.71,\n\ \ \"acc_norm_stderr\": 0.045604802157206845\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.46808510638297873,\n \"acc_stderr\": 0.03261936918467382,\n\ \ \"acc_norm\": 0.46808510638297873,\n \"acc_norm_stderr\": 0.03261936918467382\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.3508771929824561,\n\ \ \"acc_stderr\": 0.044895393502707,\n \"acc_norm\": 0.3508771929824561,\n\ \ \"acc_norm_stderr\": 0.044895393502707\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.4896551724137931,\n \"acc_stderr\": 0.04165774775728763,\n\ \ \"acc_norm\": 0.4896551724137931,\n \"acc_norm_stderr\": 0.04165774775728763\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.36243386243386244,\n \"acc_stderr\": 0.02475747390275206,\n \"\ acc_norm\": 0.36243386243386244,\n \"acc_norm_stderr\": 0.02475747390275206\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.3333333333333333,\n\ \ \"acc_stderr\": 0.04216370213557835,\n \"acc_norm\": 0.3333333333333333,\n\ \ \"acc_norm_stderr\": 0.04216370213557835\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.4,\n \"acc_stderr\": 0.049236596391733084,\n \ \ \"acc_norm\": 0.4,\n \"acc_norm_stderr\": 0.049236596391733084\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.6516129032258065,\n\ \ \"acc_stderr\": 0.02710482632810094,\n \"acc_norm\": 0.6516129032258065,\n\ \ \"acc_norm_stderr\": 0.02710482632810094\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.4187192118226601,\n \"acc_stderr\": 0.034711928605184676,\n\ \ \"acc_norm\": 0.4187192118226601,\n \"acc_norm_stderr\": 0.034711928605184676\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.57,\n \"acc_stderr\": 0.04975698519562428,\n \"acc_norm\"\ : 0.57,\n \"acc_norm_stderr\": 0.04975698519562428\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.6848484848484848,\n \"acc_stderr\": 0.0362773057502241,\n\ \ \"acc_norm\": 0.6848484848484848,\n \"acc_norm_stderr\": 0.0362773057502241\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7171717171717171,\n \"acc_stderr\": 0.032087795587867514,\n \"\ acc_norm\": 0.7171717171717171,\n \"acc_norm_stderr\": 0.032087795587867514\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.7927461139896373,\n \"acc_stderr\": 0.02925282329180363,\n\ \ \"acc_norm\": 0.7927461139896373,\n \"acc_norm_stderr\": 0.02925282329180363\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.4948717948717949,\n \"acc_stderr\": 0.025349672906838653,\n\ \ \"acc_norm\": 0.4948717948717949,\n \"acc_norm_stderr\": 0.025349672906838653\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.3111111111111111,\n \"acc_stderr\": 0.02822644674968351,\n \ \ \"acc_norm\": 0.3111111111111111,\n \"acc_norm_stderr\": 0.02822644674968351\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.5,\n \"acc_stderr\": 0.032478490123081544,\n \"acc_norm\"\ : 0.5,\n \"acc_norm_stderr\": 0.032478490123081544\n },\n \"harness|hendrycksTest-high_school_physics|5\"\ : {\n \"acc\": 0.3576158940397351,\n \"acc_stderr\": 0.03913453431177258,\n\ \ \"acc_norm\": 0.3576158940397351,\n \"acc_norm_stderr\": 0.03913453431177258\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.7577981651376147,\n \"acc_stderr\": 0.01836817630659862,\n \"\ acc_norm\": 0.7577981651376147,\n \"acc_norm_stderr\": 0.01836817630659862\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.4212962962962963,\n \"acc_stderr\": 0.03367462138896079,\n \"\ acc_norm\": 0.4212962962962963,\n \"acc_norm_stderr\": 0.03367462138896079\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.75,\n \"acc_stderr\": 0.03039153369274154,\n \"acc_norm\": 0.75,\n\ \ \"acc_norm_stderr\": 0.03039153369274154\n },\n \"harness|hendrycksTest-high_school_world_history|5\"\ : {\n \"acc\": 0.7805907172995781,\n \"acc_stderr\": 0.026939106581553945,\n\ \ \"acc_norm\": 0.7805907172995781,\n \"acc_norm_stderr\": 0.026939106581553945\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6771300448430493,\n\ \ \"acc_stderr\": 0.03138147637575499,\n \"acc_norm\": 0.6771300448430493,\n\ \ \"acc_norm_stderr\": 0.03138147637575499\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.6717557251908397,\n \"acc_stderr\": 0.04118438565806299,\n\ \ \"acc_norm\": 0.6717557251908397,\n \"acc_norm_stderr\": 0.04118438565806299\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7603305785123967,\n \"acc_stderr\": 0.03896878985070416,\n \"\ acc_norm\": 0.7603305785123967,\n \"acc_norm_stderr\": 0.03896878985070416\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7129629629629629,\n\ \ \"acc_stderr\": 0.043733130409147614,\n \"acc_norm\": 0.7129629629629629,\n\ \ \"acc_norm_stderr\": 0.043733130409147614\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7116564417177914,\n \"acc_stderr\": 0.035590395316173425,\n\ \ \"acc_norm\": 0.7116564417177914,\n \"acc_norm_stderr\": 0.035590395316173425\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.44642857142857145,\n\ \ \"acc_stderr\": 0.04718471485219588,\n \"acc_norm\": 0.44642857142857145,\n\ \ \"acc_norm_stderr\": 0.04718471485219588\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.6796116504854369,\n \"acc_stderr\": 0.04620284082280042,\n\ \ \"acc_norm\": 0.6796116504854369,\n \"acc_norm_stderr\": 0.04620284082280042\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8418803418803419,\n\ \ \"acc_stderr\": 0.0239023255495604,\n \"acc_norm\": 0.8418803418803419,\n\ \ \"acc_norm_stderr\": 0.0239023255495604\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.68,\n \"acc_stderr\": 0.046882617226215034,\n \ \ \"acc_norm\": 0.68,\n \"acc_norm_stderr\": 0.046882617226215034\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.7215836526181354,\n\ \ \"acc_stderr\": 0.016028295188992476,\n \"acc_norm\": 0.7215836526181354,\n\ \ \"acc_norm_stderr\": 0.016028295188992476\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.6011560693641619,\n \"acc_stderr\": 0.026362437574546545,\n\ \ \"acc_norm\": 0.6011560693641619,\n \"acc_norm_stderr\": 0.026362437574546545\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.3027932960893855,\n\ \ \"acc_stderr\": 0.01536686038639711,\n \"acc_norm\": 0.3027932960893855,\n\ \ \"acc_norm_stderr\": 0.01536686038639711\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.630718954248366,\n \"acc_stderr\": 0.027634176689602663,\n\ \ \"acc_norm\": 0.630718954248366,\n \"acc_norm_stderr\": 0.027634176689602663\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.5916398713826366,\n\ \ \"acc_stderr\": 0.02791705074848462,\n \"acc_norm\": 0.5916398713826366,\n\ \ \"acc_norm_stderr\": 0.02791705074848462\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.6265432098765432,\n \"acc_stderr\": 0.026915003011380154,\n\ \ \"acc_norm\": 0.6265432098765432,\n \"acc_norm_stderr\": 0.026915003011380154\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.4219858156028369,\n \"acc_stderr\": 0.029462189233370593,\n \ \ \"acc_norm\": 0.4219858156028369,\n \"acc_norm_stderr\": 0.029462189233370593\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4322033898305085,\n\ \ \"acc_stderr\": 0.012652297777114968,\n \"acc_norm\": 0.4322033898305085,\n\ \ \"acc_norm_stderr\": 0.012652297777114968\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.5147058823529411,\n \"acc_stderr\": 0.03035969707904611,\n\ \ \"acc_norm\": 0.5147058823529411,\n \"acc_norm_stderr\": 0.03035969707904611\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.5392156862745098,\n \"acc_stderr\": 0.0201655233139079,\n \ \ \"acc_norm\": 0.5392156862745098,\n \"acc_norm_stderr\": 0.0201655233139079\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6727272727272727,\n\ \ \"acc_stderr\": 0.04494290866252089,\n \"acc_norm\": 0.6727272727272727,\n\ \ \"acc_norm_stderr\": 0.04494290866252089\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.6653061224489796,\n \"acc_stderr\": 0.03020923522624231,\n\ \ \"acc_norm\": 0.6653061224489796,\n \"acc_norm_stderr\": 0.03020923522624231\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8009950248756219,\n\ \ \"acc_stderr\": 0.028231365092758406,\n \"acc_norm\": 0.8009950248756219,\n\ \ \"acc_norm_stderr\": 0.028231365092758406\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.79,\n \"acc_stderr\": 0.040936018074033256,\n \ \ \"acc_norm\": 0.79,\n \"acc_norm_stderr\": 0.040936018074033256\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.45180722891566266,\n\ \ \"acc_stderr\": 0.03874371556587953,\n \"acc_norm\": 0.45180722891566266,\n\ \ \"acc_norm_stderr\": 0.03874371556587953\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.7485380116959064,\n \"acc_stderr\": 0.033275044238468436,\n\ \ \"acc_norm\": 0.7485380116959064,\n \"acc_norm_stderr\": 0.033275044238468436\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.2876376988984088,\n\ \ \"mc1_stderr\": 0.01584631510139481,\n \"mc2\": 0.4293654330376981,\n\ \ \"mc2_stderr\": 0.014566008003477715\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7277032359905288,\n \"acc_stderr\": 0.01251069799145394\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.4351781652767248,\n \ \ \"acc_stderr\": 0.013656253875470736\n }\n}\n```" repo_url: https://huggingface.co/CohereForAI/aya-23-8B leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|arc:challenge|25_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-05-25T19-35-27.229141.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|gsm8k|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hellaswag|10_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-management|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-management|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-05-25T19-35-27.229141.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-international_law|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-management|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-marketing|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-sociology|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-virology|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-05-25T19-35-27.229141.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|truthfulqa:mc|0_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-05-25T19-35-27.229141.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_05_25T19_35_27.229141 path: - '**/details_harness|winogrande|5_2024-05-25T19-35-27.229141.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-05-25T19-35-27.229141.parquet' - config_name: results data_files: - split: 2024_05_25T19_35_27.229141 path: - results_2024-05-25T19-35-27.229141.parquet - split: latest path: - results_2024-05-25T19-35-27.229141.parquet --- # Dataset Card for Evaluation run of CohereForAI/aya-23-8B <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [CohereForAI/aya-23-8B](https://huggingface.co/CohereForAI/aya-23-8B) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_CohereForAI__aya-23-8B", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-05-25T19:35:27.229141](https://huggingface.co/datasets/open-llm-leaderboard/details_CohereForAI__aya-23-8B/blob/main/results_2024-05-25T19-35-27.229141.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.5630484989107499, "acc_stderr": 0.033643561998378806, "acc_norm": 0.566585264377983, "acc_norm_stderr": 0.03432638891884135, "mc1": 0.2876376988984088, "mc1_stderr": 0.01584631510139481, "mc2": 0.4293654330376981, "mc2_stderr": 0.014566008003477715 }, "harness|arc:challenge|25": { "acc": 0.4931740614334471, "acc_stderr": 0.014610029151379813, "acc_norm": 0.5349829351535836, "acc_norm_stderr": 0.01457558392201967 }, "harness|hellaswag|10": { "acc": 0.5766779525990838, "acc_stderr": 0.004930757390897348, "acc_norm": 0.7803226448914559, "acc_norm_stderr": 0.00413181879771388 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.27, "acc_stderr": 0.04461960433384741, "acc_norm": 0.27, "acc_norm_stderr": 0.04461960433384741 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.4740740740740741, "acc_stderr": 0.04313531696750574, "acc_norm": 0.4740740740740741, "acc_norm_stderr": 0.04313531696750574 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.5986842105263158, "acc_stderr": 0.039889037033362836, "acc_norm": 0.5986842105263158, "acc_norm_stderr": 0.039889037033362836 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.49, "acc_stderr": 0.05024183937956912, "acc_norm": 0.49, "acc_norm_stderr": 0.05024183937956912 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.5622641509433962, "acc_stderr": 0.030533338430467516, "acc_norm": 0.5622641509433962, "acc_norm_stderr": 0.030533338430467516 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.6319444444444444, "acc_stderr": 0.04032999053960718, "acc_norm": 0.6319444444444444, "acc_norm_stderr": 0.04032999053960718 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.41, "acc_stderr": 0.049431107042371025, "acc_norm": 0.41, "acc_norm_stderr": 0.049431107042371025 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.5, "acc_stderr": 0.050251890762960605, "acc_norm": 0.5, "acc_norm_stderr": 0.050251890762960605 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.35, "acc_stderr": 0.04793724854411019, "acc_norm": 0.35, "acc_norm_stderr": 0.04793724854411019 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.5549132947976878, "acc_stderr": 0.03789401760283647, "acc_norm": 0.5549132947976878, "acc_norm_stderr": 0.03789401760283647 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.2549019607843137, "acc_stderr": 0.04336432707993177, "acc_norm": 0.2549019607843137, "acc_norm_stderr": 0.04336432707993177 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.71, "acc_stderr": 0.045604802157206845, "acc_norm": 0.71, "acc_norm_stderr": 0.045604802157206845 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.46808510638297873, "acc_stderr": 0.03261936918467382, "acc_norm": 0.46808510638297873, "acc_norm_stderr": 0.03261936918467382 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.3508771929824561, "acc_stderr": 0.044895393502707, "acc_norm": 0.3508771929824561, "acc_norm_stderr": 0.044895393502707 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.4896551724137931, "acc_stderr": 0.04165774775728763, "acc_norm": 0.4896551724137931, "acc_norm_stderr": 0.04165774775728763 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.36243386243386244, "acc_stderr": 0.02475747390275206, "acc_norm": 0.36243386243386244, "acc_norm_stderr": 0.02475747390275206 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.3333333333333333, "acc_stderr": 0.04216370213557835, "acc_norm": 0.3333333333333333, "acc_norm_stderr": 0.04216370213557835 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.4, "acc_stderr": 0.049236596391733084, "acc_norm": 0.4, "acc_norm_stderr": 0.049236596391733084 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.6516129032258065, "acc_stderr": 0.02710482632810094, "acc_norm": 0.6516129032258065, "acc_norm_stderr": 0.02710482632810094 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.4187192118226601, "acc_stderr": 0.034711928605184676, "acc_norm": 0.4187192118226601, "acc_norm_stderr": 0.034711928605184676 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.57, "acc_stderr": 0.04975698519562428, "acc_norm": 0.57, "acc_norm_stderr": 0.04975698519562428 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.6848484848484848, "acc_stderr": 0.0362773057502241, "acc_norm": 0.6848484848484848, "acc_norm_stderr": 0.0362773057502241 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7171717171717171, "acc_stderr": 0.032087795587867514, "acc_norm": 0.7171717171717171, "acc_norm_stderr": 0.032087795587867514 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.7927461139896373, "acc_stderr": 0.02925282329180363, "acc_norm": 0.7927461139896373, "acc_norm_stderr": 0.02925282329180363 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.4948717948717949, "acc_stderr": 0.025349672906838653, "acc_norm": 0.4948717948717949, "acc_norm_stderr": 0.025349672906838653 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3111111111111111, "acc_stderr": 0.02822644674968351, "acc_norm": 0.3111111111111111, "acc_norm_stderr": 0.02822644674968351 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.5, "acc_stderr": 0.032478490123081544, "acc_norm": 0.5, "acc_norm_stderr": 0.032478490123081544 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.3576158940397351, "acc_stderr": 0.03913453431177258, "acc_norm": 0.3576158940397351, "acc_norm_stderr": 0.03913453431177258 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.7577981651376147, "acc_stderr": 0.01836817630659862, "acc_norm": 0.7577981651376147, "acc_norm_stderr": 0.01836817630659862 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.4212962962962963, "acc_stderr": 0.03367462138896079, "acc_norm": 0.4212962962962963, "acc_norm_stderr": 0.03367462138896079 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.75, "acc_stderr": 0.03039153369274154, "acc_norm": 0.75, "acc_norm_stderr": 0.03039153369274154 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7805907172995781, "acc_stderr": 0.026939106581553945, "acc_norm": 0.7805907172995781, "acc_norm_stderr": 0.026939106581553945 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6771300448430493, "acc_stderr": 0.03138147637575499, "acc_norm": 0.6771300448430493, "acc_norm_stderr": 0.03138147637575499 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.6717557251908397, "acc_stderr": 0.04118438565806299, "acc_norm": 0.6717557251908397, "acc_norm_stderr": 0.04118438565806299 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7603305785123967, "acc_stderr": 0.03896878985070416, "acc_norm": 0.7603305785123967, "acc_norm_stderr": 0.03896878985070416 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7129629629629629, "acc_stderr": 0.043733130409147614, "acc_norm": 0.7129629629629629, "acc_norm_stderr": 0.043733130409147614 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7116564417177914, "acc_stderr": 0.035590395316173425, "acc_norm": 0.7116564417177914, "acc_norm_stderr": 0.035590395316173425 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.44642857142857145, "acc_stderr": 0.04718471485219588, "acc_norm": 0.44642857142857145, "acc_norm_stderr": 0.04718471485219588 }, "harness|hendrycksTest-management|5": { "acc": 0.6796116504854369, "acc_stderr": 0.04620284082280042, "acc_norm": 0.6796116504854369, "acc_norm_stderr": 0.04620284082280042 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8418803418803419, "acc_stderr": 0.0239023255495604, "acc_norm": 0.8418803418803419, "acc_norm_stderr": 0.0239023255495604 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.68, "acc_stderr": 0.046882617226215034, "acc_norm": 0.68, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.7215836526181354, "acc_stderr": 0.016028295188992476, "acc_norm": 0.7215836526181354, "acc_norm_stderr": 0.016028295188992476 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.6011560693641619, "acc_stderr": 0.026362437574546545, "acc_norm": 0.6011560693641619, "acc_norm_stderr": 0.026362437574546545 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.3027932960893855, "acc_stderr": 0.01536686038639711, "acc_norm": 0.3027932960893855, "acc_norm_stderr": 0.01536686038639711 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.630718954248366, "acc_stderr": 0.027634176689602663, "acc_norm": 0.630718954248366, "acc_norm_stderr": 0.027634176689602663 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.5916398713826366, "acc_stderr": 0.02791705074848462, "acc_norm": 0.5916398713826366, "acc_norm_stderr": 0.02791705074848462 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.6265432098765432, "acc_stderr": 0.026915003011380154, "acc_norm": 0.6265432098765432, "acc_norm_stderr": 0.026915003011380154 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.4219858156028369, "acc_stderr": 0.029462189233370593, "acc_norm": 0.4219858156028369, "acc_norm_stderr": 0.029462189233370593 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.4322033898305085, "acc_stderr": 0.012652297777114968, "acc_norm": 0.4322033898305085, "acc_norm_stderr": 0.012652297777114968 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.5147058823529411, "acc_stderr": 0.03035969707904611, "acc_norm": 0.5147058823529411, "acc_norm_stderr": 0.03035969707904611 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.5392156862745098, "acc_stderr": 0.0201655233139079, "acc_norm": 0.5392156862745098, "acc_norm_stderr": 0.0201655233139079 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6727272727272727, "acc_stderr": 0.04494290866252089, "acc_norm": 0.6727272727272727, "acc_norm_stderr": 0.04494290866252089 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.6653061224489796, "acc_stderr": 0.03020923522624231, "acc_norm": 0.6653061224489796, "acc_norm_stderr": 0.03020923522624231 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8009950248756219, "acc_stderr": 0.028231365092758406, "acc_norm": 0.8009950248756219, "acc_norm_stderr": 0.028231365092758406 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.79, "acc_stderr": 0.040936018074033256, "acc_norm": 0.79, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-virology|5": { "acc": 0.45180722891566266, "acc_stderr": 0.03874371556587953, "acc_norm": 0.45180722891566266, "acc_norm_stderr": 0.03874371556587953 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.7485380116959064, "acc_stderr": 0.033275044238468436, "acc_norm": 0.7485380116959064, "acc_norm_stderr": 0.033275044238468436 }, "harness|truthfulqa:mc|0": { "mc1": 0.2876376988984088, "mc1_stderr": 0.01584631510139481, "mc2": 0.4293654330376981, "mc2_stderr": 0.014566008003477715 }, "harness|winogrande|5": { "acc": 0.7277032359905288, "acc_stderr": 0.01251069799145394 }, "harness|gsm8k|5": { "acc": 0.4351781652767248, "acc_stderr": 0.013656253875470736 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

该数据集是在对模型 CohereForAI/aya-23-8B 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集结构

  • 配置数量:63个配置,每个配置对应一个评估任务。
  • 数据来源:数据集由1次运行创建,每次运行在每个配置中作为一个特定的分割存在,分割名称使用运行的时间戳。
  • 最新结果:"train" 分割始终指向最新的结果。
  • 结果汇总:一个额外的配置 "results" 存储所有运行的汇总结果,用于计算和显示 Open LLM Leaderboard 上的汇总指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_CohereForAI__aya-23-8B", "harness_winogrande_5", split="train")

最新结果

以下是 2024-05-25T19:35:27.229141 运行的最新结果

python { "all": { "acc": 0.5630484989107499, "acc_stderr": 0.033643561998378806, "acc_norm": 0.566585264377983, "acc_norm_stderr": 0.03432638891884135, "mc1": 0.2876376988984088, "mc1_stderr": 0.01584631510139481, "mc2": 0.4293654330376981, "mc2_stderr": 0.014566008003477715 }, "harness|arc:challenge|25": { "acc": 0.4931740614334471, "acc_stderr": 0.014610029151379813, "acc_norm": 0.5349829351535836, "acc_norm_stderr": 0.01457558392201967 }, "harness|hellaswag|10": { "acc": 0.5766779525990838, "acc_stderr": 0.004930757390897348, "acc_norm": 0.7803226448914559, "acc_norm_stderr": 0.00413181879771388 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.27, "acc_stderr": 0.04461960433384741, "acc_norm": 0.27, "acc_norm_stderr": 0.04461960433384741 }, # 其他任务的结果... }

配置详情

  • config_name: harness_arc_challenge_25

    • split: 2024_05_25T19_35_27.229141
      • **/details_harness|arc:challenge|25_2024-05-25T19-35-27.229141.parquet
    • split: latest
      • **/details_harness|arc:challenge|25_2024-05-25T19-35-27.229141.parquet
  • config_name: harness_gsm8k_5

    • split: 2024_05_25T19_35_27.229141
      • **/details_harness|gsm8k|5_2024-05-25T19-35-27.229141.parquet
    • split: latest
      • **/details_harness|gsm8k|5_2024-05-25T19-35-27.229141.parquet
  • config_name: harness_hellaswag_10

    • split: 2024_05_25T19_35_27.229141
      • **/details_harness|hellaswag|10_2024-05-25T19-35-27.229141.parquet
    • split: latest
      • **/details_harness|hellaswag|10_2024-05-25T19-35-27.229141.parquet
  • config_name: harness_hendrycksTest_5

    • split: 2024_05_25T19_35_27.229141
      • **/details_harness|hendrycksTest-abstract_algebra|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-anatomy|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-astronomy|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-business_ethics|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-clinical_knowledge|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-college_biology|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-college_chemistry|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-college_computer_science|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-college_mathematics|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-college_medicine|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-college_physics|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-computer_security|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-conceptual_physics|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-econometrics|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-electrical_engineering|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-elementary_mathematics|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-formal_logic|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-global_facts|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-high_school_biology|5_2024-05-25T19-35-27.229141.parquet
      • **/details_harness|hendrycksTest-high_school_chemistry|5_2024-05-25T19-35-27.229141.parquet

      其他文件路径...

}

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,open-llm-leaderboard/details_CohereForAI__aya-23-8B数据集的构建体现了自动化与标准化的结合。该数据集源自对CohereForAI/aya-23-8B模型在Open LLM Leaderboard上的评估运行,系统自动采集了模型在63项不同任务配置下的详细表现。每次评估运行均以时间戳命名并存储为独立的数据分割,其中“train”分割始终指向最新的评估结果,而“results”配置则汇总了所有运行的聚合指标,确保了数据结构的清晰与可追溯性。
使用方法
为有效利用该数据集进行模型性能分析,用户可通过Hugging Face的datasets库便捷加载。例如,指定数据集名称、具体任务配置(如“harness_winogrande_5”)及数据分割(如“train”)即可获取相应的评估细节。数据集采用Parquet格式存储,确保了高效的数据读取与处理。这种结构化的访问方式便于研究人员进行跨任务比较、性能趋势分析或作为基准数据集成到更广泛的模型评估框架之中。
背景与挑战
背景概述
在大型语言模型(LLM)迅猛发展的背景下,评估其综合能力成为推动技术进步的关键环节。由HuggingFace团队主导的Open LLM Leaderboard项目应运而生,旨在构建一个标准化、透明化的模型性能评估平台。该数据集作为该项目的一部分,记录了CohereForAI机构于2024年发布的aya-23-8B模型在多个基准测试上的详细评估结果。其核心研究问题聚焦于如何系统性地量化LLM在常识推理、专业知识、数学计算及真实性等多维度的表现,为模型比较与优化提供了至关重要的实证依据,深刻影响了开源LLM社区的研究范式与发展方向。
当前挑战
该数据集所应对的领域挑战在于,大型语言模型的评估本身即是一个复杂且多维的难题。它需要设计能够全面覆盖模型推理能力、事实准确性、专业领域知识及道德伦理判断的多样化基准任务,例如ARC挑战赛、HellaSwag、MMLU以及TruthfulQA等。在构建过程中,挑战则体现在如何高效、自动化地整合来自数十项异构子任务的庞杂评估数据,确保每次模型运行结果的时间戳、配置与性能指标能够被精确追踪、无缝归档,并最终聚合为可复现、可比较的标准化数据格式,以支撑排行榜的公平计算与动态更新。
常用场景
经典使用场景
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard的组成部分,其经典使用场景在于为研究人员提供模型性能的细粒度分析。通过涵盖ARC挑战赛、HellaSwag、MMLU以及TruthfulQA等多样化基准任务,数据集允许对模型在常识推理、知识问答和数学解题等核心能力上进行横向对比。这种结构化的评估框架使得学术社区能够系统性地追踪模型进展,识别其优势与短板,为后续的模型优化指明方向。
解决学术问题
该数据集有效解决了大型语言模型评估中标准化与可复现性的关键学术问题。传统评估往往分散且缺乏统一指标,而本数据集通过整合多个权威基准,提供了覆盖广泛学科与认知维度的综合性能度量。其意义在于建立了透明、可比较的评估体系,促进了模型性能的客观量化,从而推动了语言模型研究从经验驱动向数据驱动的范式转变,对领域内的科学进步产生了深远影响。
实际应用
在实际应用层面,该数据集为模型选型与部署提供了关键决策依据。企业或开发者在为特定任务(如教育辅导、专业咨询或内容生成)选择基础模型时,可依据数据集提供的详细性能报告,评估模型在相关子领域(如数学、历史或伦理)的表现。这降低了技术采纳的盲目性,使得模型能力与业务需求能够精准匹配,提升了人工智能解决方案的可靠性与实用性。
数据集最近研究
最新研究方向
在大型语言模型评估领域,open-llm-leaderboard数据集作为基准测试平台,持续推动模型性能的精细化分析。当前研究聚焦于多语言与跨领域能力的综合评估,特别是针对如aya-23-8B等模型在专业学科知识、逻辑推理及数学解题等细分任务上的表现差异。随着模型规模扩大与架构优化,学术界正深入探讨评估指标的鲁棒性、偏差校正以及任务泛化能力,这些研究不仅为模型迭代提供实证依据,也促进了标准化评估框架的演进,对人工智能技术的可靠部署具有深远影响。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作