five

open-llm-leaderboard-old/details_Locutusque__llama-3-neural-chat-v1-8b

收藏
Hugging Face2024-04-20 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_Locutusque__llama-3-neural-chat-v1-8b
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of Locutusque/llama-3-neural-chat-v1-8b dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [Locutusque/llama-3-neural-chat-v1-8b](https://huggingface.co/Locutusque/llama-3-neural-chat-v1-8b)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_Locutusque__llama-3-neural-chat-v1-8b\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-04-20T21:23:35.453083](https://huggingface.co/datasets/open-llm-leaderboard/details_Locutusque__llama-3-neural-chat-v1-8b/blob/main/results_2024-04-20T21-23-35.453083.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6463757768465722,\n\ \ \"acc_stderr\": 0.032443331188726734,\n \"acc_norm\": 0.6495082726667307,\n\ \ \"acc_norm_stderr\": 0.033092506073055875,\n \"mc1\": 0.390452876376989,\n\ \ \"mc1_stderr\": 0.017078230743431448,\n \"mc2\": 0.5634222670773993,\n\ \ \"mc2_stderr\": 0.015351979609326523\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.5827645051194539,\n \"acc_stderr\": 0.014409825518403077,\n\ \ \"acc_norm\": 0.6083617747440273,\n \"acc_norm_stderr\": 0.014264122124938213\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6444931288587931,\n\ \ \"acc_stderr\": 0.004776883632722615,\n \"acc_norm\": 0.8412666799442342,\n\ \ \"acc_norm_stderr\": 0.0036468038997703434\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.37,\n \"acc_stderr\": 0.04852365870939099,\n \ \ \"acc_norm\": 0.37,\n \"acc_norm_stderr\": 0.04852365870939099\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6222222222222222,\n\ \ \"acc_stderr\": 0.04188307537595853,\n \"acc_norm\": 0.6222222222222222,\n\ \ \"acc_norm_stderr\": 0.04188307537595853\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.6710526315789473,\n \"acc_stderr\": 0.03823428969926604,\n\ \ \"acc_norm\": 0.6710526315789473,\n \"acc_norm_stderr\": 0.03823428969926604\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.65,\n\ \ \"acc_stderr\": 0.047937248544110196,\n \"acc_norm\": 0.65,\n \ \ \"acc_norm_stderr\": 0.047937248544110196\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7471698113207547,\n \"acc_stderr\": 0.026749899771241207,\n\ \ \"acc_norm\": 0.7471698113207547,\n \"acc_norm_stderr\": 0.026749899771241207\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7569444444444444,\n\ \ \"acc_stderr\": 0.03586879280080342,\n \"acc_norm\": 0.7569444444444444,\n\ \ \"acc_norm_stderr\": 0.03586879280080342\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.41,\n \"acc_stderr\": 0.049431107042371025,\n \ \ \"acc_norm\": 0.41,\n \"acc_norm_stderr\": 0.049431107042371025\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"\ acc\": 0.48,\n \"acc_stderr\": 0.050211673156867795,\n \"acc_norm\"\ : 0.48,\n \"acc_norm_stderr\": 0.050211673156867795\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.41,\n \"acc_stderr\": 0.049431107042371025,\n \ \ \"acc_norm\": 0.41,\n \"acc_norm_stderr\": 0.049431107042371025\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6069364161849711,\n\ \ \"acc_stderr\": 0.0372424959581773,\n \"acc_norm\": 0.6069364161849711,\n\ \ \"acc_norm_stderr\": 0.0372424959581773\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.4411764705882353,\n \"acc_stderr\": 0.049406356306056595,\n\ \ \"acc_norm\": 0.4411764705882353,\n \"acc_norm_stderr\": 0.049406356306056595\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.8,\n \"acc_stderr\": 0.04020151261036846,\n \"acc_norm\": 0.8,\n\ \ \"acc_norm_stderr\": 0.04020151261036846\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.5531914893617021,\n \"acc_stderr\": 0.0325005368436584,\n\ \ \"acc_norm\": 0.5531914893617021,\n \"acc_norm_stderr\": 0.0325005368436584\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.5087719298245614,\n\ \ \"acc_stderr\": 0.04702880432049615,\n \"acc_norm\": 0.5087719298245614,\n\ \ \"acc_norm_stderr\": 0.04702880432049615\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.6137931034482759,\n \"acc_stderr\": 0.04057324734419035,\n\ \ \"acc_norm\": 0.6137931034482759,\n \"acc_norm_stderr\": 0.04057324734419035\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.4021164021164021,\n \"acc_stderr\": 0.025253032554997695,\n \"\ acc_norm\": 0.4021164021164021,\n \"acc_norm_stderr\": 0.025253032554997695\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.5,\n\ \ \"acc_stderr\": 0.04472135954999579,\n \"acc_norm\": 0.5,\n \ \ \"acc_norm_stderr\": 0.04472135954999579\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.43,\n \"acc_stderr\": 0.04975698519562428,\n \ \ \"acc_norm\": 0.43,\n \"acc_norm_stderr\": 0.04975698519562428\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7548387096774194,\n\ \ \"acc_stderr\": 0.024472243840895504,\n \"acc_norm\": 0.7548387096774194,\n\ \ \"acc_norm_stderr\": 0.024472243840895504\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.49261083743842365,\n \"acc_stderr\": 0.035176035403610084,\n\ \ \"acc_norm\": 0.49261083743842365,\n \"acc_norm_stderr\": 0.035176035403610084\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.67,\n \"acc_stderr\": 0.047258156262526094,\n \"acc_norm\"\ : 0.67,\n \"acc_norm_stderr\": 0.047258156262526094\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7515151515151515,\n \"acc_stderr\": 0.033744026441394036,\n\ \ \"acc_norm\": 0.7515151515151515,\n \"acc_norm_stderr\": 0.033744026441394036\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7626262626262627,\n \"acc_stderr\": 0.0303137105381989,\n \"acc_norm\"\ : 0.7626262626262627,\n \"acc_norm_stderr\": 0.0303137105381989\n },\n\ \ \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n \ \ \"acc\": 0.8808290155440415,\n \"acc_stderr\": 0.02338193534812143,\n\ \ \"acc_norm\": 0.8808290155440415,\n \"acc_norm_stderr\": 0.02338193534812143\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6,\n \"acc_stderr\": 0.02483881198803316,\n \"acc_norm\"\ : 0.6,\n \"acc_norm_stderr\": 0.02483881198803316\n },\n \"harness|hendrycksTest-high_school_mathematics|5\"\ : {\n \"acc\": 0.3592592592592593,\n \"acc_stderr\": 0.029252905927251976,\n\ \ \"acc_norm\": 0.3592592592592593,\n \"acc_norm_stderr\": 0.029252905927251976\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.7310924369747899,\n \"acc_stderr\": 0.028801392193631276,\n\ \ \"acc_norm\": 0.7310924369747899,\n \"acc_norm_stderr\": 0.028801392193631276\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.423841059602649,\n \"acc_stderr\": 0.04034846678603397,\n \"acc_norm\"\ : 0.423841059602649,\n \"acc_norm_stderr\": 0.04034846678603397\n },\n\ \ \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\": 0.8165137614678899,\n\ \ \"acc_stderr\": 0.0165952597103993,\n \"acc_norm\": 0.8165137614678899,\n\ \ \"acc_norm_stderr\": 0.0165952597103993\n },\n \"harness|hendrycksTest-high_school_statistics|5\"\ : {\n \"acc\": 0.4675925925925926,\n \"acc_stderr\": 0.03402801581358966,\n\ \ \"acc_norm\": 0.4675925925925926,\n \"acc_norm_stderr\": 0.03402801581358966\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.8137254901960784,\n \"acc_stderr\": 0.027325470966716312,\n \"\ acc_norm\": 0.8137254901960784,\n \"acc_norm_stderr\": 0.027325470966716312\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7890295358649789,\n \"acc_stderr\": 0.026558372502661916,\n \ \ \"acc_norm\": 0.7890295358649789,\n \"acc_norm_stderr\": 0.026558372502661916\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6905829596412556,\n\ \ \"acc_stderr\": 0.03102441174057221,\n \"acc_norm\": 0.6905829596412556,\n\ \ \"acc_norm_stderr\": 0.03102441174057221\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7404580152671756,\n \"acc_stderr\": 0.03844876139785271,\n\ \ \"acc_norm\": 0.7404580152671756,\n \"acc_norm_stderr\": 0.03844876139785271\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.8181818181818182,\n \"acc_stderr\": 0.035208939510976506,\n \"\ acc_norm\": 0.8181818181818182,\n \"acc_norm_stderr\": 0.035208939510976506\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.6944444444444444,\n\ \ \"acc_stderr\": 0.04453197507374983,\n \"acc_norm\": 0.6944444444444444,\n\ \ \"acc_norm_stderr\": 0.04453197507374983\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7730061349693251,\n \"acc_stderr\": 0.03291099578615769,\n\ \ \"acc_norm\": 0.7730061349693251,\n \"acc_norm_stderr\": 0.03291099578615769\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.5892857142857143,\n\ \ \"acc_stderr\": 0.04669510663875191,\n \"acc_norm\": 0.5892857142857143,\n\ \ \"acc_norm_stderr\": 0.04669510663875191\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7864077669902912,\n \"acc_stderr\": 0.040580420156460344,\n\ \ \"acc_norm\": 0.7864077669902912,\n \"acc_norm_stderr\": 0.040580420156460344\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8418803418803419,\n\ \ \"acc_stderr\": 0.023902325549560406,\n \"acc_norm\": 0.8418803418803419,\n\ \ \"acc_norm_stderr\": 0.023902325549560406\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.79,\n \"acc_stderr\": 0.040936018074033256,\n \ \ \"acc_norm\": 0.79,\n \"acc_norm_stderr\": 0.040936018074033256\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8148148148148148,\n\ \ \"acc_stderr\": 0.013890862162876164,\n \"acc_norm\": 0.8148148148148148,\n\ \ \"acc_norm_stderr\": 0.013890862162876164\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7023121387283237,\n \"acc_stderr\": 0.024617055388676992,\n\ \ \"acc_norm\": 0.7023121387283237,\n \"acc_norm_stderr\": 0.024617055388676992\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.42681564245810055,\n\ \ \"acc_stderr\": 0.016542401954631917,\n \"acc_norm\": 0.42681564245810055,\n\ \ \"acc_norm_stderr\": 0.016542401954631917\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.738562091503268,\n \"acc_stderr\": 0.025160998214292456,\n\ \ \"acc_norm\": 0.738562091503268,\n \"acc_norm_stderr\": 0.025160998214292456\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7331189710610932,\n\ \ \"acc_stderr\": 0.02512263760881665,\n \"acc_norm\": 0.7331189710610932,\n\ \ \"acc_norm_stderr\": 0.02512263760881665\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7314814814814815,\n \"acc_stderr\": 0.024659685185967294,\n\ \ \"acc_norm\": 0.7314814814814815,\n \"acc_norm_stderr\": 0.024659685185967294\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.475177304964539,\n \"acc_stderr\": 0.02979071924382972,\n \ \ \"acc_norm\": 0.475177304964539,\n \"acc_norm_stderr\": 0.02979071924382972\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.43415906127770537,\n\ \ \"acc_stderr\": 0.01265903323706725,\n \"acc_norm\": 0.43415906127770537,\n\ \ \"acc_norm_stderr\": 0.01265903323706725\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.6691176470588235,\n \"acc_stderr\": 0.028582709753898445,\n\ \ \"acc_norm\": 0.6691176470588235,\n \"acc_norm_stderr\": 0.028582709753898445\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6944444444444444,\n \"acc_stderr\": 0.018635594034423983,\n \ \ \"acc_norm\": 0.6944444444444444,\n \"acc_norm_stderr\": 0.018635594034423983\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6727272727272727,\n\ \ \"acc_stderr\": 0.0449429086625209,\n \"acc_norm\": 0.6727272727272727,\n\ \ \"acc_norm_stderr\": 0.0449429086625209\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7551020408163265,\n \"acc_stderr\": 0.027529637440174934,\n\ \ \"acc_norm\": 0.7551020408163265,\n \"acc_norm_stderr\": 0.027529637440174934\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.835820895522388,\n\ \ \"acc_stderr\": 0.026193923544454125,\n \"acc_norm\": 0.835820895522388,\n\ \ \"acc_norm_stderr\": 0.026193923544454125\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.84,\n \"acc_stderr\": 0.03684529491774708,\n \ \ \"acc_norm\": 0.84,\n \"acc_norm_stderr\": 0.03684529491774708\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5120481927710844,\n\ \ \"acc_stderr\": 0.03891364495835816,\n \"acc_norm\": 0.5120481927710844,\n\ \ \"acc_norm_stderr\": 0.03891364495835816\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8245614035087719,\n \"acc_stderr\": 0.029170885500727665,\n\ \ \"acc_norm\": 0.8245614035087719,\n \"acc_norm_stderr\": 0.029170885500727665\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.390452876376989,\n\ \ \"mc1_stderr\": 0.017078230743431448,\n \"mc2\": 0.5634222670773993,\n\ \ \"mc2_stderr\": 0.015351979609326523\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7821625887924231,\n \"acc_stderr\": 0.011601066079939324\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.5481425322213799,\n \ \ \"acc_stderr\": 0.013708494995677646\n }\n}\n```" repo_url: https://huggingface.co/Locutusque/llama-3-neural-chat-v1-8b leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|arc:challenge|25_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-04-20T21-23-35.453083.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|gsm8k|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hellaswag|10_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-20T21-23-35.453083.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-management|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-virology|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-20T21-23-35.453083.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|truthfulqa:mc|0_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-04-20T21-23-35.453083.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_04_20T21_23_35.453083 path: - '**/details_harness|winogrande|5_2024-04-20T21-23-35.453083.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-04-20T21-23-35.453083.parquet' - config_name: results data_files: - split: 2024_04_20T21_23_35.453083 path: - results_2024-04-20T21-23-35.453083.parquet - split: latest path: - results_2024-04-20T21-23-35.453083.parquet --- # Dataset Card for Evaluation run of Locutusque/llama-3-neural-chat-v1-8b <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [Locutusque/llama-3-neural-chat-v1-8b](https://huggingface.co/Locutusque/llama-3-neural-chat-v1-8b) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Locutusque__llama-3-neural-chat-v1-8b", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-04-20T21:23:35.453083](https://huggingface.co/datasets/open-llm-leaderboard/details_Locutusque__llama-3-neural-chat-v1-8b/blob/main/results_2024-04-20T21-23-35.453083.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6463757768465722, "acc_stderr": 0.032443331188726734, "acc_norm": 0.6495082726667307, "acc_norm_stderr": 0.033092506073055875, "mc1": 0.390452876376989, "mc1_stderr": 0.017078230743431448, "mc2": 0.5634222670773993, "mc2_stderr": 0.015351979609326523 }, "harness|arc:challenge|25": { "acc": 0.5827645051194539, "acc_stderr": 0.014409825518403077, "acc_norm": 0.6083617747440273, "acc_norm_stderr": 0.014264122124938213 }, "harness|hellaswag|10": { "acc": 0.6444931288587931, "acc_stderr": 0.004776883632722615, "acc_norm": 0.8412666799442342, "acc_norm_stderr": 0.0036468038997703434 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.37, "acc_stderr": 0.04852365870939099, "acc_norm": 0.37, "acc_norm_stderr": 0.04852365870939099 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6222222222222222, "acc_stderr": 0.04188307537595853, "acc_norm": 0.6222222222222222, "acc_norm_stderr": 0.04188307537595853 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6710526315789473, "acc_stderr": 0.03823428969926604, "acc_norm": 0.6710526315789473, "acc_norm_stderr": 0.03823428969926604 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.65, "acc_stderr": 0.047937248544110196, "acc_norm": 0.65, "acc_norm_stderr": 0.047937248544110196 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7471698113207547, "acc_stderr": 0.026749899771241207, "acc_norm": 0.7471698113207547, "acc_norm_stderr": 0.026749899771241207 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7569444444444444, "acc_stderr": 0.03586879280080342, "acc_norm": 0.7569444444444444, "acc_norm_stderr": 0.03586879280080342 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.41, "acc_stderr": 0.049431107042371025, "acc_norm": 0.41, "acc_norm_stderr": 0.049431107042371025 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.48, "acc_stderr": 0.050211673156867795, "acc_norm": 0.48, "acc_norm_stderr": 0.050211673156867795 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.41, "acc_stderr": 0.049431107042371025, "acc_norm": 0.41, "acc_norm_stderr": 0.049431107042371025 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6069364161849711, "acc_stderr": 0.0372424959581773, "acc_norm": 0.6069364161849711, "acc_norm_stderr": 0.0372424959581773 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4411764705882353, "acc_stderr": 0.049406356306056595, "acc_norm": 0.4411764705882353, "acc_norm_stderr": 0.049406356306056595 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.8, "acc_stderr": 0.04020151261036846, "acc_norm": 0.8, "acc_norm_stderr": 0.04020151261036846 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5531914893617021, "acc_stderr": 0.0325005368436584, "acc_norm": 0.5531914893617021, "acc_norm_stderr": 0.0325005368436584 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5087719298245614, "acc_stderr": 0.04702880432049615, "acc_norm": 0.5087719298245614, "acc_norm_stderr": 0.04702880432049615 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.6137931034482759, "acc_stderr": 0.04057324734419035, "acc_norm": 0.6137931034482759, "acc_norm_stderr": 0.04057324734419035 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.4021164021164021, "acc_stderr": 0.025253032554997695, "acc_norm": 0.4021164021164021, "acc_norm_stderr": 0.025253032554997695 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.5, "acc_stderr": 0.04472135954999579, "acc_norm": 0.5, "acc_norm_stderr": 0.04472135954999579 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.43, "acc_stderr": 0.04975698519562428, "acc_norm": 0.43, "acc_norm_stderr": 0.04975698519562428 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7548387096774194, "acc_stderr": 0.024472243840895504, "acc_norm": 0.7548387096774194, "acc_norm_stderr": 0.024472243840895504 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.49261083743842365, "acc_stderr": 0.035176035403610084, "acc_norm": 0.49261083743842365, "acc_norm_stderr": 0.035176035403610084 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.67, "acc_stderr": 0.047258156262526094, "acc_norm": 0.67, "acc_norm_stderr": 0.047258156262526094 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7515151515151515, "acc_stderr": 0.033744026441394036, "acc_norm": 0.7515151515151515, "acc_norm_stderr": 0.033744026441394036 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7626262626262627, "acc_stderr": 0.0303137105381989, "acc_norm": 0.7626262626262627, "acc_norm_stderr": 0.0303137105381989 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8808290155440415, "acc_stderr": 0.02338193534812143, "acc_norm": 0.8808290155440415, "acc_norm_stderr": 0.02338193534812143 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6, "acc_stderr": 0.02483881198803316, "acc_norm": 0.6, "acc_norm_stderr": 0.02483881198803316 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3592592592592593, "acc_stderr": 0.029252905927251976, "acc_norm": 0.3592592592592593, "acc_norm_stderr": 0.029252905927251976 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.7310924369747899, "acc_stderr": 0.028801392193631276, "acc_norm": 0.7310924369747899, "acc_norm_stderr": 0.028801392193631276 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.423841059602649, "acc_stderr": 0.04034846678603397, "acc_norm": 0.423841059602649, "acc_norm_stderr": 0.04034846678603397 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8165137614678899, "acc_stderr": 0.0165952597103993, "acc_norm": 0.8165137614678899, "acc_norm_stderr": 0.0165952597103993 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.4675925925925926, "acc_stderr": 0.03402801581358966, "acc_norm": 0.4675925925925926, "acc_norm_stderr": 0.03402801581358966 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.8137254901960784, "acc_stderr": 0.027325470966716312, "acc_norm": 0.8137254901960784, "acc_norm_stderr": 0.027325470966716312 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7890295358649789, "acc_stderr": 0.026558372502661916, "acc_norm": 0.7890295358649789, "acc_norm_stderr": 0.026558372502661916 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6905829596412556, "acc_stderr": 0.03102441174057221, "acc_norm": 0.6905829596412556, "acc_norm_stderr": 0.03102441174057221 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7404580152671756, "acc_stderr": 0.03844876139785271, "acc_norm": 0.7404580152671756, "acc_norm_stderr": 0.03844876139785271 }, "harness|hendrycksTest-international_law|5": { "acc": 0.8181818181818182, "acc_stderr": 0.035208939510976506, "acc_norm": 0.8181818181818182, "acc_norm_stderr": 0.035208939510976506 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.6944444444444444, "acc_stderr": 0.04453197507374983, "acc_norm": 0.6944444444444444, "acc_norm_stderr": 0.04453197507374983 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7730061349693251, "acc_stderr": 0.03291099578615769, "acc_norm": 0.7730061349693251, "acc_norm_stderr": 0.03291099578615769 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.5892857142857143, "acc_stderr": 0.04669510663875191, "acc_norm": 0.5892857142857143, "acc_norm_stderr": 0.04669510663875191 }, "harness|hendrycksTest-management|5": { "acc": 0.7864077669902912, "acc_stderr": 0.040580420156460344, "acc_norm": 0.7864077669902912, "acc_norm_stderr": 0.040580420156460344 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8418803418803419, "acc_stderr": 0.023902325549560406, "acc_norm": 0.8418803418803419, "acc_norm_stderr": 0.023902325549560406 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.79, "acc_stderr": 0.040936018074033256, "acc_norm": 0.79, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8148148148148148, "acc_stderr": 0.013890862162876164, "acc_norm": 0.8148148148148148, "acc_norm_stderr": 0.013890862162876164 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7023121387283237, "acc_stderr": 0.024617055388676992, "acc_norm": 0.7023121387283237, "acc_norm_stderr": 0.024617055388676992 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.42681564245810055, "acc_stderr": 0.016542401954631917, "acc_norm": 0.42681564245810055, "acc_norm_stderr": 0.016542401954631917 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.738562091503268, "acc_stderr": 0.025160998214292456, "acc_norm": 0.738562091503268, "acc_norm_stderr": 0.025160998214292456 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7331189710610932, "acc_stderr": 0.02512263760881665, "acc_norm": 0.7331189710610932, "acc_norm_stderr": 0.02512263760881665 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7314814814814815, "acc_stderr": 0.024659685185967294, "acc_norm": 0.7314814814814815, "acc_norm_stderr": 0.024659685185967294 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.475177304964539, "acc_stderr": 0.02979071924382972, "acc_norm": 0.475177304964539, "acc_norm_stderr": 0.02979071924382972 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.43415906127770537, "acc_stderr": 0.01265903323706725, "acc_norm": 0.43415906127770537, "acc_norm_stderr": 0.01265903323706725 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6691176470588235, "acc_stderr": 0.028582709753898445, "acc_norm": 0.6691176470588235, "acc_norm_stderr": 0.028582709753898445 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6944444444444444, "acc_stderr": 0.018635594034423983, "acc_norm": 0.6944444444444444, "acc_norm_stderr": 0.018635594034423983 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6727272727272727, "acc_stderr": 0.0449429086625209, "acc_norm": 0.6727272727272727, "acc_norm_stderr": 0.0449429086625209 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7551020408163265, "acc_stderr": 0.027529637440174934, "acc_norm": 0.7551020408163265, "acc_norm_stderr": 0.027529637440174934 }, "harness|hendrycksTest-sociology|5": { "acc": 0.835820895522388, "acc_stderr": 0.026193923544454125, "acc_norm": 0.835820895522388, "acc_norm_stderr": 0.026193923544454125 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.84, "acc_stderr": 0.03684529491774708, "acc_norm": 0.84, "acc_norm_stderr": 0.03684529491774708 }, "harness|hendrycksTest-virology|5": { "acc": 0.5120481927710844, "acc_stderr": 0.03891364495835816, "acc_norm": 0.5120481927710844, "acc_norm_stderr": 0.03891364495835816 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8245614035087719, "acc_stderr": 0.029170885500727665, "acc_norm": 0.8245614035087719, "acc_norm_stderr": 0.029170885500727665 }, "harness|truthfulqa:mc|0": { "mc1": 0.390452876376989, "mc1_stderr": 0.017078230743431448, "mc2": 0.5634222670773993, "mc2_stderr": 0.015351979609326523 }, "harness|winogrande|5": { "acc": 0.7821625887924231, "acc_stderr": 0.011601066079939324 }, "harness|gsm8k|5": { "acc": 0.5481425322213799, "acc_stderr": 0.013708494995677646 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

该数据集是在评估模型 Locutusque/llama-3-neural-chat-v1-8bOpen LLM Leaderboard 上的运行过程中自动创建的。

数据集组成

  • 该数据集包含 63 个配置,每个配置对应一个评估任务。
  • 数据集从 1 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Locutusque__llama-3-neural-chat-v1-8b", "harness_winogrande_5", split="train")

最新结果

以下是 2024-04-20T21:23:35.453083 运行的最新结果

python { "all": { "acc": 0.6463757768465722, "acc_stderr": 0.032443331188726734, "acc_norm": 0.6495082726667307, "acc_norm_stderr": 0.033092506073055875, "mc1": 0.390452876376989, "mc1_stderr": 0.017078230743431448, "mc2": 0.5634222670773993, "mc2_stderr": 0.015351979609326523 }, "harness|arc:challenge|25": { "acc": 0.5827645051194539, "acc_stderr": 0.014409825518403077, "acc_norm": 0.6083617747440273, "acc_norm_stderr": 0.014264122124938213 }, "harness|hellaswag|10": { "acc": 0.6444931288587931, "acc_stderr": 0.004776883632722615, "acc_norm": 0.8412666799442342, "acc_norm_stderr": 0.0036468038997703434 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.37, "acc_stderr": 0.04852365870939099, "acc_norm": 0.37, "acc_norm_stderr": 0.04852365870939099 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6222222222222222, "acc_stderr": 0.04188307537595853, "acc_norm": 0.6222222222222222, "acc_norm_stderr": 0.04188307537595853 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6710526315789473, "acc_stderr": 0.03823428969926604, "acc_norm": 0.6710526315789473, "acc_norm_stderr": 0.03823428969926604 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.65, "acc_stderr": 0.047937248544110196, "acc_norm": 0.65, "acc_norm_stderr": 0.047937248544110196 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7471698113207547, "acc_stderr": 0.026749899771241207, "acc_norm": 0.7471698113207547, "acc_norm_stderr": 0.026749899771241207 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7569444444444444, "acc_stderr": 0.03586879280080342, "acc_norm": 0.7569444444444444, "acc_norm_stderr": 0.03586879280080342 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.41, "acc_stderr": 0.049431107042371025, "acc_norm": 0.41, "acc_norm_stderr": 0.049431107042371025 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.48, "acc_stderr": 0.050211673156867795, "acc_norm": 0.48, "acc_norm_stderr": 0.050211673156867795 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.41, "acc_stderr": 0.049431107042371025, "acc_norm": 0.41, "acc_norm_stderr": 0.049431107042371025 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6069364161849711, "acc_stderr": 0.0372424959581773, "acc_norm": 0.6069364161849711, "acc_norm_stderr": 0.0372424959581773 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4411764705882353, "acc_stderr": 0.049406356306056595, "acc_norm": 0.4411764705882353, "acc_norm_stderr": 0.049406356306056595 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.8, "acc_stderr": 0.04020151261036846, "acc_norm": 0.8, "acc_norm_stderr": 0.04020151261036846 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5531914893617021, "acc_stderr": 0.0325005368436584, "acc_norm": 0.5531914893617021, "acc_norm_stderr": 0.0325005368436584 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5087719298245614, "acc_stderr": 0.04702880432049615, "acc_norm": 0.5087719298245614, "acc_norm_stderr": 0.04702880432049615 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.6137931034482759, "acc_stderr": 0.04057324734419035, "acc_norm": 0.6137931034482759, "acc_norm_stderr": 0.04057324734419035 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.4021164021164021, "acc_stderr": 0.025253032554997695, "acc_norm": 0.4021164021164021, "acc_norm_stderr": 0.025253032554997695 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.5, "acc_stderr": 0.04472135954999579, "acc_norm": 0.5, "acc_norm_stderr": 0.04472135954999579 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.43, "acc_stderr": 0.04975698519562428, "acc_norm": 0.43, "acc_norm_stderr": 0.04975698519562428 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7548387096774194, "acc_stderr": 0.024472243840895504, "acc_norm": 0.7548387096774194, "acc_norm_stderr": 0.024472243840895504 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.49261083743842365, "acc_stderr": 0.035176035403610084, "acc_norm": 0.49261083743842365, "acc_norm_stderr": 0.035176035403610084 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.67, "acc_stderr": 0.047258156262526094, "acc_norm": 0.67, "acc_norm_stderr": 0.047258156262526094 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7515151515151515, "acc_stderr": 0.033744026441394036, "acc_norm": 0.7515151515151515, "acc_norm_stderr": 0.033744026441394036 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7626262626262627, "acc_stderr": 0.0303137105381989, "acc_norm": 0.7626262626262627, "acc_norm_stderr": 0.0303137105381989 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8808290155440415, "acc_stderr": 0.02338193534812143, "acc_norm": 0.8808290155440415, "acc_norm_stderr": 0.02338193534812143 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6, "acc_stderr": 0.02483881198803316, "acc_norm": 0.6, "acc_norm_stderr": 0.02483881198803316 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3592592592592593, "acc_stderr": 0.029252905927251976, "acc_norm": 0.3592592592592593, "acc_norm_stderr": 0.029252905927251976 }, "harness|hendrycksTest-high_school_microeconomics|

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是Open LLM Leaderboard在对Locutusque/llama-3-neural-chat-v1-8b模型进行自动化评估过程中生成的副产品。其构建逻辑围绕评估任务展开,共包含63个配置,每个配置对应一项被评估的基准测试任务。数据来源于一次完整的评估运行,运行记录以时间戳命名的分割形式存储于各配置中,其中“train”分割始终指向最新的评估结果。此外,一个名为“results”的专属配置汇聚了该次运行的所有聚合指标,用于在排行榜上计算和展示模型的综合表现。
特点
该数据集的核心特点在于其结构化的评估记录体系,每个配置独立存储特定任务的详细评估数据,便于研究者针对性地分析模型在不同维度上的能力。数据集采用Parquet格式存储,兼顾了高效读取与存储压缩。时间戳分割的设计使得历史结果得以保留,支持对模型性能的纵向追踪。而“results”配置则提供了全局视角,囊括了诸如准确率、标准化准确率及其标准误等关键指标,为模型间的横向比较提供了统一基准。
使用方法
研究者可通过Hugging Face的datasets库便捷地加载该数据集。加载时需指定配置名称(如"harness_winogrande_5")和分割名称(如"train"),即可获取对应任务的最新评估详情。例如,执行`data = load_dataset("open-llm-leaderboard/details_Locutusque__llama-3-neural-chat-v1-8b", "harness_winogrande_5", split="train")`即可加载Winogrande任务的评估数据。若要获取历史运行结果,可替换分割名称为对应的时间戳字符串。
背景与挑战
背景概述
在大规模语言模型(LLM)性能评估领域,Open LLM Leaderboard由HuggingFace团队于2023年发起,旨在为开源社区提供标准化、可复现的模型评测基准。该数据集记录了Locutusque/llama-3-neural-chat-v1-8b模型在2024年4月20日的评估运行细节,涵盖63个配置项,对应ARC-Challenge、HellaSwag、MMLU、TruthfulQA、Winogrande及GSM8K等多元任务。其核心研究问题在于通过细粒度性能指标(如准确率及其标准误)量化模型在推理、常识、数学及知识理解等维度的能力,为LLM的横向对比与迭代优化提供可靠数据支撑。该数据集作为Open LLM Leaderboard生态的重要组成部分,推动了开源模型评估的透明化与科学化进程。
当前挑战
该数据集所解决的领域挑战在于LLM评估的碎片化与不可复现性:不同研究机构采用差异化的测试集与指标,导致模型性能难以公平比较。其构建过程中面临多重技术挑战——需整合来自HellaSwag、MMLU等异构基准的63个子任务,并统一为Parquet格式的标准化数据结构;同时需处理评估日志的时间戳管理,通过分片(split)机制区分不同运行批次,确保最新结果(latest split)始终指向最新评估数据。此外,数据集的元数据设计需兼容多任务聚合(如results配置项),在压缩存储大量细粒度指标(如acc_norm、mc2)时保持查询效率,这对数据建模的鲁棒性提出了较高要求。
常用场景
经典使用场景
该数据集作为Open LLM Leaderboard评估流程的产物,核心用途在于系统化记录Locutusque/llama-3-neural-chat-v1-8b模型在63项任务上的细粒度表现。研究者可借助该数据集复现模型在ARC挑战、HellaSwag、GSM8K等经典基准上的推理过程,通过解析每个配置项下的逐样本结果,深入分析模型在常识推理、数学求解、知识问答等维度的能力边界。其结构化存储方式支持按时间戳追溯不同评估轮次的结果演变,为对比模型迭代效果提供了标准化参照框架。
实际应用
实际应用中,该数据集为模型选型与领域适配提供了精准的决策依据。开发者可依据数据集内57个学科测试(如HendrycksTest系列)的细粒度得分,快速定位模型在特定专业领域(如临床知识、计算机安全)的适用性。例如,在构建医疗问答系统时,可优先筛选临床知识准确率超过70%的模型版本。此外,数据集的时间戳特性支持对模型进行持续监控,当模型经过微调或量化后,可通过对比新旧版本在相同任务上的得分变化,验证优化策略的有效性。
衍生相关工作
该数据集衍生了一系列围绕模型评估方法论的研究工作。基于其提供的逐任务结果,学术界开发了多种评估可视化工具,如雷达图生成器与能力维度分解算法,用于直观展示模型在57个学科上的知识分布。部分工作进一步利用该数据集的置信区间信息,构建了模型性能的贝叶斯推断框架,以更稳健的方式比较不同模型的优劣。此外,该数据集也启发了针对特定任务(如GSM8K数学推理)的失败案例分析,催生了若干关于模型逻辑链条断裂原因的实证研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作