five

open-llm-leaderboard/details_TeeZee__NEBULA-23B-v1.0

收藏
Hugging Face2024-04-21 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_TeeZee__NEBULA-23B-v1.0
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of TeeZee/NEBULA-23B-v1.0 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [TeeZee/NEBULA-23B-v1.0](https://huggingface.co/TeeZee/NEBULA-23B-v1.0) on the\ \ [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_TeeZee__NEBULA-23B-v1.0\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-04-21T04:37:46.733123](https://huggingface.co/datasets/open-llm-leaderboard/details_TeeZee__NEBULA-23B-v1.0/blob/main/results_2024-04-21T04-37-46.733123.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6462596022657905,\n\ \ \"acc_stderr\": 0.031691217871483365,\n \"acc_norm\": 0.6579155418095005,\n\ \ \"acc_norm_stderr\": 0.03255881834990093,\n \"mc1\": 0.41982864137086906,\n\ \ \"mc1_stderr\": 0.01727703030177577,\n \"mc2\": 0.5759564361377313,\n\ \ \"mc2_stderr\": 0.015144335391267196\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.6305460750853242,\n \"acc_stderr\": 0.014104578366491895,\n\ \ \"acc_norm\": 0.6672354948805461,\n \"acc_norm_stderr\": 0.013769863046192304\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6818362875921131,\n\ \ \"acc_stderr\": 0.0046481153223287735,\n \"acc_norm\": 0.8698466440948018,\n\ \ \"acc_norm_stderr\": 0.0033578442491239546\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.39,\n \"acc_stderr\": 0.04902071300001975,\n \ \ \"acc_norm\": 0.39,\n \"acc_norm_stderr\": 0.04902071300001975\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.5407407407407407,\n\ \ \"acc_stderr\": 0.04304979692464241,\n \"acc_norm\": 0.5407407407407407,\n\ \ \"acc_norm_stderr\": 0.04304979692464241\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.7828947368421053,\n \"acc_stderr\": 0.033550453048829226,\n\ \ \"acc_norm\": 0.7828947368421053,\n \"acc_norm_stderr\": 0.033550453048829226\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.73,\n\ \ \"acc_stderr\": 0.04461960433384741,\n \"acc_norm\": 0.73,\n \ \ \"acc_norm_stderr\": 0.04461960433384741\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.6754716981132075,\n \"acc_stderr\": 0.028815615713432104,\n\ \ \"acc_norm\": 0.6754716981132075,\n \"acc_norm_stderr\": 0.028815615713432104\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7430555555555556,\n\ \ \"acc_stderr\": 0.03653946969442099,\n \"acc_norm\": 0.7430555555555556,\n\ \ \"acc_norm_stderr\": 0.03653946969442099\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.42,\n \"acc_stderr\": 0.04960449637488584,\n \ \ \"acc_norm\": 0.42,\n \"acc_norm_stderr\": 0.04960449637488584\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.56,\n \"acc_stderr\": 0.04988876515698589,\n \"acc_norm\": 0.56,\n\ \ \"acc_norm_stderr\": 0.04988876515698589\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.32,\n \"acc_stderr\": 0.046882617226215034,\n \ \ \"acc_norm\": 0.32,\n \"acc_norm_stderr\": 0.046882617226215034\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.630057803468208,\n\ \ \"acc_stderr\": 0.036812296333943194,\n \"acc_norm\": 0.630057803468208,\n\ \ \"acc_norm_stderr\": 0.036812296333943194\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.35294117647058826,\n \"acc_stderr\": 0.04755129616062947,\n\ \ \"acc_norm\": 0.35294117647058826,\n \"acc_norm_stderr\": 0.04755129616062947\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.73,\n \"acc_stderr\": 0.044619604333847394,\n \"acc_norm\": 0.73,\n\ \ \"acc_norm_stderr\": 0.044619604333847394\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.6127659574468085,\n \"acc_stderr\": 0.03184389265339525,\n\ \ \"acc_norm\": 0.6127659574468085,\n \"acc_norm_stderr\": 0.03184389265339525\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.5350877192982456,\n\ \ \"acc_stderr\": 0.046920083813689104,\n \"acc_norm\": 0.5350877192982456,\n\ \ \"acc_norm_stderr\": 0.046920083813689104\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.593103448275862,\n \"acc_stderr\": 0.04093793981266236,\n\ \ \"acc_norm\": 0.593103448275862,\n \"acc_norm_stderr\": 0.04093793981266236\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.47883597883597884,\n \"acc_stderr\": 0.025728230952130726,\n \"\ acc_norm\": 0.47883597883597884,\n \"acc_norm_stderr\": 0.025728230952130726\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.42063492063492064,\n\ \ \"acc_stderr\": 0.04415438226743744,\n \"acc_norm\": 0.42063492063492064,\n\ \ \"acc_norm_stderr\": 0.04415438226743744\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.38,\n \"acc_stderr\": 0.04878317312145632,\n \ \ \"acc_norm\": 0.38,\n \"acc_norm_stderr\": 0.04878317312145632\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7741935483870968,\n\ \ \"acc_stderr\": 0.023785577884181012,\n \"acc_norm\": 0.7741935483870968,\n\ \ \"acc_norm_stderr\": 0.023785577884181012\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.5024630541871922,\n \"acc_stderr\": 0.03517945038691063,\n\ \ \"acc_norm\": 0.5024630541871922,\n \"acc_norm_stderr\": 0.03517945038691063\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.7,\n \"acc_stderr\": 0.046056618647183814,\n \"acc_norm\"\ : 0.7,\n \"acc_norm_stderr\": 0.046056618647183814\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.8181818181818182,\n \"acc_stderr\": 0.030117688929503575,\n\ \ \"acc_norm\": 0.8181818181818182,\n \"acc_norm_stderr\": 0.030117688929503575\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.8383838383838383,\n \"acc_stderr\": 0.026225919863629283,\n \"\ acc_norm\": 0.8383838383838383,\n \"acc_norm_stderr\": 0.026225919863629283\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.9119170984455959,\n \"acc_stderr\": 0.02045374660160103,\n\ \ \"acc_norm\": 0.9119170984455959,\n \"acc_norm_stderr\": 0.02045374660160103\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6564102564102564,\n \"acc_stderr\": 0.02407869658063547,\n \ \ \"acc_norm\": 0.6564102564102564,\n \"acc_norm_stderr\": 0.02407869658063547\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.36666666666666664,\n \"acc_stderr\": 0.029381620726465073,\n \ \ \"acc_norm\": 0.36666666666666664,\n \"acc_norm_stderr\": 0.029381620726465073\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6932773109243697,\n \"acc_stderr\": 0.02995382389188703,\n \ \ \"acc_norm\": 0.6932773109243697,\n \"acc_norm_stderr\": 0.02995382389188703\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.3443708609271523,\n \"acc_stderr\": 0.038796870240733264,\n \"\ acc_norm\": 0.3443708609271523,\n \"acc_norm_stderr\": 0.038796870240733264\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8366972477064221,\n \"acc_stderr\": 0.015848255806501562,\n \"\ acc_norm\": 0.8366972477064221,\n \"acc_norm_stderr\": 0.015848255806501562\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.6018518518518519,\n \"acc_stderr\": 0.033384734032074016,\n \"\ acc_norm\": 0.6018518518518519,\n \"acc_norm_stderr\": 0.033384734032074016\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.8480392156862745,\n \"acc_stderr\": 0.025195658428931796,\n \"\ acc_norm\": 0.8480392156862745,\n \"acc_norm_stderr\": 0.025195658428931796\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.8227848101265823,\n \"acc_stderr\": 0.024856364184503214,\n \ \ \"acc_norm\": 0.8227848101265823,\n \"acc_norm_stderr\": 0.024856364184503214\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.695067264573991,\n\ \ \"acc_stderr\": 0.030898610882477515,\n \"acc_norm\": 0.695067264573991,\n\ \ \"acc_norm_stderr\": 0.030898610882477515\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.6717557251908397,\n \"acc_stderr\": 0.04118438565806298,\n\ \ \"acc_norm\": 0.6717557251908397,\n \"acc_norm_stderr\": 0.04118438565806298\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7603305785123967,\n \"acc_stderr\": 0.03896878985070417,\n \"\ acc_norm\": 0.7603305785123967,\n \"acc_norm_stderr\": 0.03896878985070417\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7962962962962963,\n\ \ \"acc_stderr\": 0.03893542518824847,\n \"acc_norm\": 0.7962962962962963,\n\ \ \"acc_norm_stderr\": 0.03893542518824847\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7607361963190185,\n \"acc_stderr\": 0.033519538795212696,\n\ \ \"acc_norm\": 0.7607361963190185,\n \"acc_norm_stderr\": 0.033519538795212696\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.45535714285714285,\n\ \ \"acc_stderr\": 0.04726835553719099,\n \"acc_norm\": 0.45535714285714285,\n\ \ \"acc_norm_stderr\": 0.04726835553719099\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7864077669902912,\n \"acc_stderr\": 0.040580420156460344,\n\ \ \"acc_norm\": 0.7864077669902912,\n \"acc_norm_stderr\": 0.040580420156460344\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8803418803418803,\n\ \ \"acc_stderr\": 0.021262719400406957,\n \"acc_norm\": 0.8803418803418803,\n\ \ \"acc_norm_stderr\": 0.021262719400406957\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.75,\n \"acc_stderr\": 0.04351941398892446,\n \ \ \"acc_norm\": 0.75,\n \"acc_norm_stderr\": 0.04351941398892446\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8122605363984674,\n\ \ \"acc_stderr\": 0.013964393769899133,\n \"acc_norm\": 0.8122605363984674,\n\ \ \"acc_norm_stderr\": 0.013964393769899133\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7312138728323699,\n \"acc_stderr\": 0.023868003262500107,\n\ \ \"acc_norm\": 0.7312138728323699,\n \"acc_norm_stderr\": 0.023868003262500107\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.4335195530726257,\n\ \ \"acc_stderr\": 0.01657402721951763,\n \"acc_norm\": 0.4335195530726257,\n\ \ \"acc_norm_stderr\": 0.01657402721951763\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7222222222222222,\n \"acc_stderr\": 0.0256468630971379,\n\ \ \"acc_norm\": 0.7222222222222222,\n \"acc_norm_stderr\": 0.0256468630971379\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7138263665594855,\n\ \ \"acc_stderr\": 0.025670259242188933,\n \"acc_norm\": 0.7138263665594855,\n\ \ \"acc_norm_stderr\": 0.025670259242188933\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7716049382716049,\n \"acc_stderr\": 0.023358211840626267,\n\ \ \"acc_norm\": 0.7716049382716049,\n \"acc_norm_stderr\": 0.023358211840626267\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.5106382978723404,\n \"acc_stderr\": 0.02982074719142244,\n \ \ \"acc_norm\": 0.5106382978723404,\n \"acc_norm_stderr\": 0.02982074719142244\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4921773142112125,\n\ \ \"acc_stderr\": 0.012768673076111906,\n \"acc_norm\": 0.4921773142112125,\n\ \ \"acc_norm_stderr\": 0.012768673076111906\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.7022058823529411,\n \"acc_stderr\": 0.027778298701545436,\n\ \ \"acc_norm\": 0.7022058823529411,\n \"acc_norm_stderr\": 0.027778298701545436\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6993464052287581,\n \"acc_stderr\": 0.018550634502952964,\n \ \ \"acc_norm\": 0.6993464052287581,\n \"acc_norm_stderr\": 0.018550634502952964\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.7272727272727273,\n\ \ \"acc_stderr\": 0.04265792110940589,\n \"acc_norm\": 0.7272727272727273,\n\ \ \"acc_norm_stderr\": 0.04265792110940589\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7428571428571429,\n \"acc_stderr\": 0.02797982353874455,\n\ \ \"acc_norm\": 0.7428571428571429,\n \"acc_norm_stderr\": 0.02797982353874455\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.845771144278607,\n\ \ \"acc_stderr\": 0.025538433368578337,\n \"acc_norm\": 0.845771144278607,\n\ \ \"acc_norm_stderr\": 0.025538433368578337\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.9,\n \"acc_stderr\": 0.03015113445777634,\n \ \ \"acc_norm\": 0.9,\n \"acc_norm_stderr\": 0.03015113445777634\n },\n\ \ \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5301204819277109,\n\ \ \"acc_stderr\": 0.03885425420866766,\n \"acc_norm\": 0.5301204819277109,\n\ \ \"acc_norm_stderr\": 0.03885425420866766\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.7777777777777778,\n \"acc_stderr\": 0.03188578017686398,\n\ \ \"acc_norm\": 0.7777777777777778,\n \"acc_norm_stderr\": 0.03188578017686398\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.41982864137086906,\n\ \ \"mc1_stderr\": 0.01727703030177577,\n \"mc2\": 0.5759564361377313,\n\ \ \"mc2_stderr\": 0.015144335391267196\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.829518547750592,\n \"acc_stderr\": 0.010569021122825902\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.0,\n \"acc_stderr\"\ : 0.0\n }\n}\n```" repo_url: https://huggingface.co/TeeZee/NEBULA-23B-v1.0 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|arc:challenge|25_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-04-21T04-37-46.733123.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|gsm8k|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hellaswag|10_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-management|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-04-21T04-37-46.733123.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-management|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-virology|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-04-21T04-37-46.733123.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|truthfulqa:mc|0_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-04-21T04-37-46.733123.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_04_21T04_37_46.733123 path: - '**/details_harness|winogrande|5_2024-04-21T04-37-46.733123.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-04-21T04-37-46.733123.parquet' - config_name: results data_files: - split: 2024_04_21T04_37_46.733123 path: - results_2024-04-21T04-37-46.733123.parquet - split: latest path: - results_2024-04-21T04-37-46.733123.parquet --- # Dataset Card for Evaluation run of TeeZee/NEBULA-23B-v1.0 <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [TeeZee/NEBULA-23B-v1.0](https://huggingface.co/TeeZee/NEBULA-23B-v1.0) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_TeeZee__NEBULA-23B-v1.0", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-04-21T04:37:46.733123](https://huggingface.co/datasets/open-llm-leaderboard/details_TeeZee__NEBULA-23B-v1.0/blob/main/results_2024-04-21T04-37-46.733123.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6462596022657905, "acc_stderr": 0.031691217871483365, "acc_norm": 0.6579155418095005, "acc_norm_stderr": 0.03255881834990093, "mc1": 0.41982864137086906, "mc1_stderr": 0.01727703030177577, "mc2": 0.5759564361377313, "mc2_stderr": 0.015144335391267196 }, "harness|arc:challenge|25": { "acc": 0.6305460750853242, "acc_stderr": 0.014104578366491895, "acc_norm": 0.6672354948805461, "acc_norm_stderr": 0.013769863046192304 }, "harness|hellaswag|10": { "acc": 0.6818362875921131, "acc_stderr": 0.0046481153223287735, "acc_norm": 0.8698466440948018, "acc_norm_stderr": 0.0033578442491239546 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.39, "acc_stderr": 0.04902071300001975, "acc_norm": 0.39, "acc_norm_stderr": 0.04902071300001975 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.5407407407407407, "acc_stderr": 0.04304979692464241, "acc_norm": 0.5407407407407407, "acc_norm_stderr": 0.04304979692464241 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7828947368421053, "acc_stderr": 0.033550453048829226, "acc_norm": 0.7828947368421053, "acc_norm_stderr": 0.033550453048829226 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.73, "acc_stderr": 0.04461960433384741, "acc_norm": 0.73, "acc_norm_stderr": 0.04461960433384741 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6754716981132075, "acc_stderr": 0.028815615713432104, "acc_norm": 0.6754716981132075, "acc_norm_stderr": 0.028815615713432104 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7430555555555556, "acc_stderr": 0.03653946969442099, "acc_norm": 0.7430555555555556, "acc_norm_stderr": 0.03653946969442099 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.42, "acc_stderr": 0.04960449637488584, "acc_norm": 0.42, "acc_norm_stderr": 0.04960449637488584 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.56, "acc_stderr": 0.04988876515698589, "acc_norm": 0.56, "acc_norm_stderr": 0.04988876515698589 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.630057803468208, "acc_stderr": 0.036812296333943194, "acc_norm": 0.630057803468208, "acc_norm_stderr": 0.036812296333943194 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.35294117647058826, "acc_stderr": 0.04755129616062947, "acc_norm": 0.35294117647058826, "acc_norm_stderr": 0.04755129616062947 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.73, "acc_stderr": 0.044619604333847394, "acc_norm": 0.73, "acc_norm_stderr": 0.044619604333847394 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.6127659574468085, "acc_stderr": 0.03184389265339525, "acc_norm": 0.6127659574468085, "acc_norm_stderr": 0.03184389265339525 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5350877192982456, "acc_stderr": 0.046920083813689104, "acc_norm": 0.5350877192982456, "acc_norm_stderr": 0.046920083813689104 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.593103448275862, "acc_stderr": 0.04093793981266236, "acc_norm": 0.593103448275862, "acc_norm_stderr": 0.04093793981266236 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.47883597883597884, "acc_stderr": 0.025728230952130726, "acc_norm": 0.47883597883597884, "acc_norm_stderr": 0.025728230952130726 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.42063492063492064, "acc_stderr": 0.04415438226743744, "acc_norm": 0.42063492063492064, "acc_norm_stderr": 0.04415438226743744 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.38, "acc_stderr": 0.04878317312145632, "acc_norm": 0.38, "acc_norm_stderr": 0.04878317312145632 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7741935483870968, "acc_stderr": 0.023785577884181012, "acc_norm": 0.7741935483870968, "acc_norm_stderr": 0.023785577884181012 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5024630541871922, "acc_stderr": 0.03517945038691063, "acc_norm": 0.5024630541871922, "acc_norm_stderr": 0.03517945038691063 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.8181818181818182, "acc_stderr": 0.030117688929503575, "acc_norm": 0.8181818181818182, "acc_norm_stderr": 0.030117688929503575 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.8383838383838383, "acc_stderr": 0.026225919863629283, "acc_norm": 0.8383838383838383, "acc_norm_stderr": 0.026225919863629283 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.9119170984455959, "acc_stderr": 0.02045374660160103, "acc_norm": 0.9119170984455959, "acc_norm_stderr": 0.02045374660160103 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6564102564102564, "acc_stderr": 0.02407869658063547, "acc_norm": 0.6564102564102564, "acc_norm_stderr": 0.02407869658063547 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.36666666666666664, "acc_stderr": 0.029381620726465073, "acc_norm": 0.36666666666666664, "acc_norm_stderr": 0.029381620726465073 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6932773109243697, "acc_stderr": 0.02995382389188703, "acc_norm": 0.6932773109243697, "acc_norm_stderr": 0.02995382389188703 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.3443708609271523, "acc_stderr": 0.038796870240733264, "acc_norm": 0.3443708609271523, "acc_norm_stderr": 0.038796870240733264 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8366972477064221, "acc_stderr": 0.015848255806501562, "acc_norm": 0.8366972477064221, "acc_norm_stderr": 0.015848255806501562 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.6018518518518519, "acc_stderr": 0.033384734032074016, "acc_norm": 0.6018518518518519, "acc_norm_stderr": 0.033384734032074016 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.8480392156862745, "acc_stderr": 0.025195658428931796, "acc_norm": 0.8480392156862745, "acc_norm_stderr": 0.025195658428931796 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.8227848101265823, "acc_stderr": 0.024856364184503214, "acc_norm": 0.8227848101265823, "acc_norm_stderr": 0.024856364184503214 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.695067264573991, "acc_stderr": 0.030898610882477515, "acc_norm": 0.695067264573991, "acc_norm_stderr": 0.030898610882477515 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.6717557251908397, "acc_stderr": 0.04118438565806298, "acc_norm": 0.6717557251908397, "acc_norm_stderr": 0.04118438565806298 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7603305785123967, "acc_stderr": 0.03896878985070417, "acc_norm": 0.7603305785123967, "acc_norm_stderr": 0.03896878985070417 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7962962962962963, "acc_stderr": 0.03893542518824847, "acc_norm": 0.7962962962962963, "acc_norm_stderr": 0.03893542518824847 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7607361963190185, "acc_stderr": 0.033519538795212696, "acc_norm": 0.7607361963190185, "acc_norm_stderr": 0.033519538795212696 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.45535714285714285, "acc_stderr": 0.04726835553719099, "acc_norm": 0.45535714285714285, "acc_norm_stderr": 0.04726835553719099 }, "harness|hendrycksTest-management|5": { "acc": 0.7864077669902912, "acc_stderr": 0.040580420156460344, "acc_norm": 0.7864077669902912, "acc_norm_stderr": 0.040580420156460344 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8803418803418803, "acc_stderr": 0.021262719400406957, "acc_norm": 0.8803418803418803, "acc_norm_stderr": 0.021262719400406957 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.75, "acc_stderr": 0.04351941398892446, "acc_norm": 0.75, "acc_norm_stderr": 0.04351941398892446 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8122605363984674, "acc_stderr": 0.013964393769899133, "acc_norm": 0.8122605363984674, "acc_norm_stderr": 0.013964393769899133 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7312138728323699, "acc_stderr": 0.023868003262500107, "acc_norm": 0.7312138728323699, "acc_norm_stderr": 0.023868003262500107 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.4335195530726257, "acc_stderr": 0.01657402721951763, "acc_norm": 0.4335195530726257, "acc_norm_stderr": 0.01657402721951763 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7222222222222222, "acc_stderr": 0.0256468630971379, "acc_norm": 0.7222222222222222, "acc_norm_stderr": 0.0256468630971379 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7138263665594855, "acc_stderr": 0.025670259242188933, "acc_norm": 0.7138263665594855, "acc_norm_stderr": 0.025670259242188933 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7716049382716049, "acc_stderr": 0.023358211840626267, "acc_norm": 0.7716049382716049, "acc_norm_stderr": 0.023358211840626267 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.5106382978723404, "acc_stderr": 0.02982074719142244, "acc_norm": 0.5106382978723404, "acc_norm_stderr": 0.02982074719142244 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.4921773142112125, "acc_stderr": 0.012768673076111906, "acc_norm": 0.4921773142112125, "acc_norm_stderr": 0.012768673076111906 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.7022058823529411, "acc_stderr": 0.027778298701545436, "acc_norm": 0.7022058823529411, "acc_norm_stderr": 0.027778298701545436 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6993464052287581, "acc_stderr": 0.018550634502952964, "acc_norm": 0.6993464052287581, "acc_norm_stderr": 0.018550634502952964 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.7272727272727273, "acc_stderr": 0.04265792110940589, "acc_norm": 0.7272727272727273, "acc_norm_stderr": 0.04265792110940589 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7428571428571429, "acc_stderr": 0.02797982353874455, "acc_norm": 0.7428571428571429, "acc_norm_stderr": 0.02797982353874455 }, "harness|hendrycksTest-sociology|5": { "acc": 0.845771144278607, "acc_stderr": 0.025538433368578337, "acc_norm": 0.845771144278607, "acc_norm_stderr": 0.025538433368578337 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.9, "acc_stderr": 0.03015113445777634, "acc_norm": 0.9, "acc_norm_stderr": 0.03015113445777634 }, "harness|hendrycksTest-virology|5": { "acc": 0.5301204819277109, "acc_stderr": 0.03885425420866766, "acc_norm": 0.5301204819277109, "acc_norm_stderr": 0.03885425420866766 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.7777777777777778, "acc_stderr": 0.03188578017686398, "acc_norm": 0.7777777777777778, "acc_norm_stderr": 0.03188578017686398 }, "harness|truthfulqa:mc|0": { "mc1": 0.41982864137086906, "mc1_stderr": 0.01727703030177577, "mc2": 0.5759564361377313, "mc2_stderr": 0.015144335391267196 }, "harness|winogrande|5": { "acc": 0.829518547750592, "acc_stderr": 0.010569021122825902 }, "harness|gsm8k|5": { "acc": 0.0, "acc_stderr": 0.0 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

数据集简介

该数据集是在对模型 TeeZee/NEBULA-23B-v1.0 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集结构

  • 配置数量:63个配置,每个配置对应一个评估任务。
  • 创建来源:从1次运行中创建。每个运行在每个配置中作为一个特定的分割存在,分割名称使用运行的时间戳。
  • 最新结果:"train" 分割始终指向最新结果。
  • 结果汇总:一个额外的配置 "results" 存储所有运行的汇总结果,用于计算和显示在 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_TeeZee__NEBULA-23B-v1.0", "harness_winogrande_5", split="train")

最新结果

以下是 2024-04-21T04:37:46.733123 运行的最新结果:

python { "all": { "acc": 0.6462596022657905, "acc_stderr": 0.031691217871483365, "acc_norm": 0.6579155418095005, "acc_norm_stderr": 0.03255881834990093, "mc1": 0.41982864137086906, "mc1_stderr": 0.01727703030177577, "mc2": 0.5759564361377313, "mc2_stderr": 0.015144335391267196 }, "harness|arc:challenge|25": { "acc": 0.6305460750853242, "acc_stderr": 0.014104578366491895, "acc_norm": 0.6672354948805461, "acc_norm_stderr": 0.013769863046192304 }, "harness|hellaswag|10": { "acc": 0.6818362875921131, "acc_stderr": 0.0046481153223287735, "acc_norm": 0.8698466440948018, "acc_norm_stderr": 0.0033578442491239546 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.39, "acc_stderr": 0.04902071300001975, "acc_norm": 0.39, "acc_norm_stderr": 0.04902071300001975 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.5407407407407407, "acc_stderr": 0.04304979692464241, "acc_norm": 0.5407407407407407, "acc_norm_stderr": 0.04304979692464241 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7828947368421053, "acc_stderr": 0.033550453048829226, "acc_norm": 0.7828947368421053, "acc_norm_stderr": 0.033550453048829226 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.73, "acc_stderr": 0.04461960433384741, "acc_norm": 0.73, "acc_norm_stderr": 0.04461960433384741 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6754716981132075, "acc_stderr": 0.028815615713432104, "acc_norm": 0.6754716981132075, "acc_norm_stderr": 0.028815615713432104 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7430555555555556, "acc_stderr": 0.03653946969442099, "acc_norm": 0.7430555555555556, "acc_norm_stderr": 0.03653946969442099 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.42, "acc_stderr": 0.04960449637488584, "acc_norm": 0.42, "acc_norm_stderr": 0.04960449637488584 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.56, "acc_stderr": 0.04988876515698589, "acc_norm": 0.56, "acc_norm_stderr": 0.04988876515698589 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.630057803468208, "acc_stderr": 0.036812296333943194, "acc_norm": 0.630057803468208, "acc_norm_stderr": 0.036812296333943194 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.35294117647058826, "acc_stderr": 0.04755129616062947, "acc_norm": 0.35294117647058826, "acc_norm_stderr": 0.04755129616062947 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.73, "acc_stderr": 0.044619604333847394, "acc_norm": 0.73, "acc_norm_stderr": 0.044619604333847394 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.6127659574468085, "acc_stderr": 0.03184389265339525, "acc_norm": 0.6127659574468085, "acc_norm_stderr": 0.03184389265339525 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5350877192982456, "acc_stderr": 0.046920083813689104, "acc_norm": 0.5350877192982456, "acc_norm_stderr": 0.046920083813689104 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.593103448275862, "acc_stderr": 0.04093793981266236, "acc_norm": 0.593103448275862, "acc_norm_stderr": 0.04093793981266236 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.47883597883597884, "acc_stderr": 0.025728230952130726, "acc_norm": 0.47883597883597884, "acc_norm_stderr": 0.025728230952130726 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.42063492063492064, "acc_stderr": 0.04415438226743744, "acc_norm": 0.42063492063492064, "acc_norm_stderr": 0.04415438226743744 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.38, "acc_stderr": 0.04878317312145632, "acc_norm": 0.38, "acc_norm_stderr": 0.04878317312145632 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7741935483870968, "acc_stderr": 0.023785577884181012, "acc_norm": 0.7741935483870968, "acc_norm_stderr": 0.023785577884181012 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5024630541871922, "acc_stderr": 0.03517945038691063, "acc_norm": 0.5024630541871922, "acc_norm_stderr": 0.03517945038691063 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.8181818181818182, "acc_stderr": 0.030117688929503575, "acc_norm": 0.8181818181818182, "acc_norm_stderr": 0.030117688929503575 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.8383838383838383, "acc_stderr": 0.026225919863629283, "acc_norm": 0.8383838383838383, "acc_norm_stderr": 0.026225919863629283 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.9119170984455959, "acc_stderr": 0.02045374660160103, "acc_norm": 0.9119170984455959, "acc_norm_stderr": 0.02045374660160103 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6564102564102564, "acc_stderr": 0.02407869658063547, "acc_norm": 0.6564102564102564, "acc_norm_stderr": 0.02407869658063547 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.36666666666666664, "acc_stderr": 0.029381620726465073, "

搜集汇总
数据集介绍
main_image_url
构建方式
在大规模语言模型评测领域,模型性能的量化评估依赖于标准化评测框架。该数据集是在Open LLM Leaderboard评测框架下,对TeeZee/NEBULA-23B-v1.0模型进行自动化评测时自动生成的产物。数据集涵盖了63个评测任务配置,每个配置对应一个特定的评估任务,包括ARC挑战集、HellaSwag、GSM8K、Winogrande、TruthfulQA以及涵盖57个学科的MMLU测试集。数据来源于单次评测运行,每次运行的结果以时间戳命名的分割形式存储,其中'train'分割始终指向最新评测结果。此外,数据集还包含一个名为'results'的独立配置,用于聚合所有任务的整体指标,以支持排行榜上综合分数的计算与展示。
特点
该数据集的核心特色在于其结构化的多任务评测记录体系。63个配置分别对应不同的评估维度,从常识推理、数学解题到多学科知识覆盖,全面反映了模型在多样化场景下的表现。每个配置内通过时间戳分割实现历史版本追溯,便于追踪模型性能的演进轨迹。'results'配置则提供了归一化的聚合指标,包括准确率及其标准差,使得跨任务对比和整体性能评估成为可能。数据以Parquet格式高效存储,兼顾了大规模数据处理的便捷性与读取速度。
使用方法
研究者可通过HuggingFace的datasets库便捷地调用该数据集。例如,使用load_dataset函数加载指定任务配置(如'harness_winogrande_5')并选择'split="train"'即可获取最新评测结果。若要追溯历史运行数据,可通过时间戳命名的分割(如'2024_04_21T04_37_46.733123')访问对应时点的详细信息。数据集中的'results'配置则提供了直接调取聚合指标的途径,便于快速生成模型性能报告或进行多模型对比分析。
背景与挑战
背景概述
随着大规模语言模型(LLM)的蓬勃发展,对其性能进行系统性、标准化评估成为推动该领域进步的关键环节。在此背景下,Hugging Face团队于2023年发起了Open LLM Leaderboard项目,旨在为社区提供一个公开、透明的模型评测基准平台。该数据集正是针对TeeZee/NEBULA-23B-v1.0模型在Open LLM Leaderboard上的评测结果而自动构建,记录了2024年4月21日对这款230亿参数模型的全面评估数据。作为NEBULA系列的代表性模型,其评测涵盖了从常识推理(如ARC、HellaSwag)到专业学科知识(如MMLU中的医学、法律、数学等57个学科)以及数学推理(GSM8K)等63个任务配置,为理解该模型在多样化认知能力上的表现提供了详实依据,对推动LLM性能比较与模型优化具有重要参考价值。
当前挑战
该数据集核心挑战体现在两个层面。其一,所解决的领域问题聚焦于大规模语言模型评测的标准化与全面性:现有评测往往局限于少数基准,难以揭示模型在多维度能力上的真实水平,而NEBULA-23B-v1.0的评估结果(如GSM8K数学推理准确率为0.0%)暴露了当前模型在复杂推理任务上的显著短板,凸显了构建覆盖广泛、粒度精细的评测体系的必要性。其二,构建过程中面临数据整合与动态更新的挑战:数据集需从单次评测运行中自动提取63个异构任务的详细结果,并维护不同时间戳的版本历史;同时需确保各任务配置(如ARC的25-shot、HellaSwag的10-shot)的评估设置一致,且在结果聚合时准确计算平均值与标准误差,这对数据管道的稳健性与可复现性提出了严苛要求。
常用场景
经典使用场景
该数据集专为评估大规模语言模型在多样化推理任务上的性能而设计,常用于衡量模型在常识推理、科学知识、数学计算及阅读理解等维度的综合能力。其经典使用场景包括在ARC挑战赛、HellaSwag、WinoGrande、GSM8K以及涵盖57个学科领域的MMLU基准测试上运行标准化评估,从而系统性地揭示模型在零样本或少样本设定下的泛化表现与鲁棒性。
实际应用
在实际应用中,该数据集被广泛用于大模型的版本迭代验证与商业部署前的能力筛选。开发者可借助其多维度评测结果,精准定位模型在数学推理、法律知识或医学常识等特定领域的短板,从而指导定向微调与数据增强策略。此外,教育科技与智能问答系统也常以此作为基准,确保模型输出在事实性与逻辑一致性上达到可靠标准。
衍生相关工作
基于该数据集衍生出的经典工作包括Open LLM Leaderboard的持续更新机制,以及众多针对特定任务(如GSM8K数学推理)的专项改进研究。研究者们进一步提出了动态难度调整的评测方案,并探索了多任务联合训练对评测分数的影响,这些工作不仅丰富了模型评估的方法论,还催生了诸如Few-shot CoT提示优化等前沿方向,显著加速了开源大模型生态的成熟。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作