five

open-llm-leaderboard-old/details_Community-LM__llava-v1.5-13b-hf

收藏
Hugging Face2023-10-10 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_Community-LM__llava-v1.5-13b-hf
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of Community-LM/llava-v1.5-13b-hf dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [Community-LM/llava-v1.5-13b-hf](https://huggingface.co/Community-LM/llava-v1.5-13b-hf)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 61 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the agregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_Community-LM__llava-v1.5-13b-hf\"\ ,\n\t\"harness_truthfulqa_mc_0\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\ \nThese are the [latest results from run 2023-10-10T14:01:34.065508](https://huggingface.co/datasets/open-llm-leaderboard/details_Community-LM__llava-v1.5-13b-hf/blob/main/results_2023-10-10T14-01-34.065508.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.5687974861474466,\n\ \ \"acc_stderr\": 0.034102420636387375,\n \"acc_norm\": 0.5727205361494934,\n\ \ \"acc_norm_stderr\": 0.034085436281331656,\n \"mc1\": 0.3011015911872705,\n\ \ \"mc1_stderr\": 0.016058999026100612,\n \"mc2\": 0.433460825483405,\n\ \ \"mc2_stderr\": 0.01517244922847158\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.5324232081911263,\n \"acc_stderr\": 0.01458063756999542,\n\ \ \"acc_norm\": 0.5614334470989761,\n \"acc_norm_stderr\": 0.014500682618212864\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6011750647281418,\n\ \ \"acc_stderr\": 0.004886559008754983,\n \"acc_norm\": 0.8036247759410476,\n\ \ \"acc_norm_stderr\": 0.003964437012249994\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.3,\n \"acc_stderr\": 0.046056618647183814,\n \ \ \"acc_norm\": 0.3,\n \"acc_norm_stderr\": 0.046056618647183814\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.4962962962962963,\n\ \ \"acc_stderr\": 0.04319223625811331,\n \"acc_norm\": 0.4962962962962963,\n\ \ \"acc_norm_stderr\": 0.04319223625811331\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.5855263157894737,\n \"acc_stderr\": 0.04008973785779206,\n\ \ \"acc_norm\": 0.5855263157894737,\n \"acc_norm_stderr\": 0.04008973785779206\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.67,\n\ \ \"acc_stderr\": 0.047258156262526094,\n \"acc_norm\": 0.67,\n \ \ \"acc_norm_stderr\": 0.047258156262526094\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.6037735849056604,\n \"acc_stderr\": 0.030102793781791197,\n\ \ \"acc_norm\": 0.6037735849056604,\n \"acc_norm_stderr\": 0.030102793781791197\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.6041666666666666,\n\ \ \"acc_stderr\": 0.04089465449325582,\n \"acc_norm\": 0.6041666666666666,\n\ \ \"acc_norm_stderr\": 0.04089465449325582\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.39,\n \"acc_stderr\": 0.04902071300001975,\n \ \ \"acc_norm\": 0.39,\n \"acc_norm_stderr\": 0.04902071300001975\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.45,\n \"acc_stderr\": 0.049999999999999996,\n \"acc_norm\": 0.45,\n\ \ \"acc_norm_stderr\": 0.049999999999999996\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.36,\n \"acc_stderr\": 0.04824181513244218,\n \ \ \"acc_norm\": 0.36,\n \"acc_norm_stderr\": 0.04824181513244218\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.5375722543352601,\n\ \ \"acc_stderr\": 0.0380168510452446,\n \"acc_norm\": 0.5375722543352601,\n\ \ \"acc_norm_stderr\": 0.0380168510452446\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.2647058823529412,\n \"acc_stderr\": 0.043898699568087764,\n\ \ \"acc_norm\": 0.2647058823529412,\n \"acc_norm_stderr\": 0.043898699568087764\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.68,\n \"acc_stderr\": 0.04688261722621505,\n \"acc_norm\": 0.68,\n\ \ \"acc_norm_stderr\": 0.04688261722621505\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.44680851063829785,\n \"acc_stderr\": 0.0325005368436584,\n\ \ \"acc_norm\": 0.44680851063829785,\n \"acc_norm_stderr\": 0.0325005368436584\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.3333333333333333,\n\ \ \"acc_stderr\": 0.044346007015849245,\n \"acc_norm\": 0.3333333333333333,\n\ \ \"acc_norm_stderr\": 0.044346007015849245\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5103448275862069,\n \"acc_stderr\": 0.04165774775728763,\n\ \ \"acc_norm\": 0.5103448275862069,\n \"acc_norm_stderr\": 0.04165774775728763\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.328042328042328,\n \"acc_stderr\": 0.0241804971643769,\n \"acc_norm\"\ : 0.328042328042328,\n \"acc_norm_stderr\": 0.0241804971643769\n },\n\ \ \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.35714285714285715,\n\ \ \"acc_stderr\": 0.04285714285714281,\n \"acc_norm\": 0.35714285714285715,\n\ \ \"acc_norm_stderr\": 0.04285714285714281\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.3,\n \"acc_stderr\": 0.046056618647183814,\n \ \ \"acc_norm\": 0.3,\n \"acc_norm_stderr\": 0.046056618647183814\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7129032258064516,\n\ \ \"acc_stderr\": 0.025736542745594528,\n \"acc_norm\": 0.7129032258064516,\n\ \ \"acc_norm_stderr\": 0.025736542745594528\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.42857142857142855,\n \"acc_stderr\": 0.03481904844438803,\n\ \ \"acc_norm\": 0.42857142857142855,\n \"acc_norm_stderr\": 0.03481904844438803\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.58,\n \"acc_stderr\": 0.049604496374885836,\n \"acc_norm\"\ : 0.58,\n \"acc_norm_stderr\": 0.049604496374885836\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7151515151515152,\n \"acc_stderr\": 0.03524390844511781,\n\ \ \"acc_norm\": 0.7151515151515152,\n \"acc_norm_stderr\": 0.03524390844511781\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7424242424242424,\n \"acc_stderr\": 0.031156269519646836,\n \"\ acc_norm\": 0.7424242424242424,\n \"acc_norm_stderr\": 0.031156269519646836\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.8393782383419689,\n \"acc_stderr\": 0.026499057701397433,\n\ \ \"acc_norm\": 0.8393782383419689,\n \"acc_norm_stderr\": 0.026499057701397433\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.5384615384615384,\n \"acc_stderr\": 0.025275892070240644,\n\ \ \"acc_norm\": 0.5384615384615384,\n \"acc_norm_stderr\": 0.025275892070240644\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.3148148148148148,\n \"acc_stderr\": 0.028317533496066475,\n \ \ \"acc_norm\": 0.3148148148148148,\n \"acc_norm_stderr\": 0.028317533496066475\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.5672268907563025,\n \"acc_stderr\": 0.032183581077426124,\n\ \ \"acc_norm\": 0.5672268907563025,\n \"acc_norm_stderr\": 0.032183581077426124\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.26490066225165565,\n \"acc_stderr\": 0.03603038545360384,\n \"\ acc_norm\": 0.26490066225165565,\n \"acc_norm_stderr\": 0.03603038545360384\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.7577981651376147,\n \"acc_stderr\": 0.018368176306598618,\n \"\ acc_norm\": 0.7577981651376147,\n \"acc_norm_stderr\": 0.018368176306598618\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.42592592592592593,\n \"acc_stderr\": 0.03372343271653063,\n \"\ acc_norm\": 0.42592592592592593,\n \"acc_norm_stderr\": 0.03372343271653063\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.7549019607843137,\n \"acc_stderr\": 0.030190282453501947,\n \"\ acc_norm\": 0.7549019607843137,\n \"acc_norm_stderr\": 0.030190282453501947\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7805907172995781,\n \"acc_stderr\": 0.026939106581553945,\n \ \ \"acc_norm\": 0.7805907172995781,\n \"acc_norm_stderr\": 0.026939106581553945\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6681614349775785,\n\ \ \"acc_stderr\": 0.03160295143776678,\n \"acc_norm\": 0.6681614349775785,\n\ \ \"acc_norm_stderr\": 0.03160295143776678\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.6412213740458015,\n \"acc_stderr\": 0.04206739313864908,\n\ \ \"acc_norm\": 0.6412213740458015,\n \"acc_norm_stderr\": 0.04206739313864908\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.71900826446281,\n \"acc_stderr\": 0.04103203830514512,\n \"acc_norm\"\ : 0.71900826446281,\n \"acc_norm_stderr\": 0.04103203830514512\n },\n\ \ \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7314814814814815,\n\ \ \"acc_stderr\": 0.042844679680521934,\n \"acc_norm\": 0.7314814814814815,\n\ \ \"acc_norm_stderr\": 0.042844679680521934\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.6319018404907976,\n \"acc_stderr\": 0.03789213935838396,\n\ \ \"acc_norm\": 0.6319018404907976,\n \"acc_norm_stderr\": 0.03789213935838396\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.41964285714285715,\n\ \ \"acc_stderr\": 0.04684099321077106,\n \"acc_norm\": 0.41964285714285715,\n\ \ \"acc_norm_stderr\": 0.04684099321077106\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7572815533980582,\n \"acc_stderr\": 0.04245022486384493,\n\ \ \"acc_norm\": 0.7572815533980582,\n \"acc_norm_stderr\": 0.04245022486384493\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8461538461538461,\n\ \ \"acc_stderr\": 0.02363687331748928,\n \"acc_norm\": 0.8461538461538461,\n\ \ \"acc_norm_stderr\": 0.02363687331748928\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.64,\n \"acc_stderr\": 0.048241815132442176,\n \ \ \"acc_norm\": 0.64,\n \"acc_norm_stderr\": 0.048241815132442176\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.7739463601532567,\n\ \ \"acc_stderr\": 0.014957458504335835,\n \"acc_norm\": 0.7739463601532567,\n\ \ \"acc_norm_stderr\": 0.014957458504335835\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.6271676300578035,\n \"acc_stderr\": 0.02603389061357628,\n\ \ \"acc_norm\": 0.6271676300578035,\n \"acc_norm_stderr\": 0.02603389061357628\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.3240223463687151,\n\ \ \"acc_stderr\": 0.015652542496421114,\n \"acc_norm\": 0.3240223463687151,\n\ \ \"acc_norm_stderr\": 0.015652542496421114\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.6078431372549019,\n \"acc_stderr\": 0.027956046165424523,\n\ \ \"acc_norm\": 0.6078431372549019,\n \"acc_norm_stderr\": 0.027956046165424523\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.6237942122186495,\n\ \ \"acc_stderr\": 0.02751392568354943,\n \"acc_norm\": 0.6237942122186495,\n\ \ \"acc_norm_stderr\": 0.02751392568354943\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.6975308641975309,\n \"acc_stderr\": 0.025557653981868045,\n\ \ \"acc_norm\": 0.6975308641975309,\n \"acc_norm_stderr\": 0.025557653981868045\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.4078014184397163,\n \"acc_stderr\": 0.029316011776343555,\n \ \ \"acc_norm\": 0.4078014184397163,\n \"acc_norm_stderr\": 0.029316011776343555\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.41590612777053454,\n\ \ \"acc_stderr\": 0.012588323850313608,\n \"acc_norm\": 0.41590612777053454,\n\ \ \"acc_norm_stderr\": 0.012588323850313608\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.5477941176470589,\n \"acc_stderr\": 0.030233758551596445,\n\ \ \"acc_norm\": 0.5477941176470589,\n \"acc_norm_stderr\": 0.030233758551596445\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.5784313725490197,\n \"acc_stderr\": 0.019977422600227477,\n \ \ \"acc_norm\": 0.5784313725490197,\n \"acc_norm_stderr\": 0.019977422600227477\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6,\n\ \ \"acc_stderr\": 0.0469237132203465,\n \"acc_norm\": 0.6,\n \ \ \"acc_norm_stderr\": 0.0469237132203465\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.6530612244897959,\n \"acc_stderr\": 0.030472526026726496,\n\ \ \"acc_norm\": 0.6530612244897959,\n \"acc_norm_stderr\": 0.030472526026726496\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.7611940298507462,\n\ \ \"acc_stderr\": 0.03014777593540922,\n \"acc_norm\": 0.7611940298507462,\n\ \ \"acc_norm_stderr\": 0.03014777593540922\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.81,\n \"acc_stderr\": 0.03942772444036625,\n \ \ \"acc_norm\": 0.81,\n \"acc_norm_stderr\": 0.03942772444036625\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5060240963855421,\n\ \ \"acc_stderr\": 0.03892212195333045,\n \"acc_norm\": 0.5060240963855421,\n\ \ \"acc_norm_stderr\": 0.03892212195333045\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.7953216374269005,\n \"acc_stderr\": 0.03094445977853321,\n\ \ \"acc_norm\": 0.7953216374269005,\n \"acc_norm_stderr\": 0.03094445977853321\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.3011015911872705,\n\ \ \"mc1_stderr\": 0.016058999026100612,\n \"mc2\": 0.433460825483405,\n\ \ \"mc2_stderr\": 0.01517244922847158\n }\n}\n```" repo_url: https://huggingface.co/Community-LM/llava-v1.5-13b-hf leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|arc:challenge|25_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hellaswag|10_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-management|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-management|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-10-10T14-01-34.065508.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-international_law|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-management|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-marketing|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-sociology|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-virology|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-10-10T14-01-34.065508.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_10_10T14_01_34.065508 path: - '**/details_harness|truthfulqa:mc|0_2023-10-10T14-01-34.065508.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-10-10T14-01-34.065508.parquet' - config_name: results data_files: - split: 2023_10_10T14_01_34.065508 path: - results_2023-10-10T14-01-34.065508.parquet - split: latest path: - results_2023-10-10T14-01-34.065508.parquet --- # Dataset Card for Evaluation run of Community-LM/llava-v1.5-13b-hf ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/Community-LM/llava-v1.5-13b-hf - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [Community-LM/llava-v1.5-13b-hf](https://huggingface.co/Community-LM/llava-v1.5-13b-hf) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 61 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the agregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Community-LM__llava-v1.5-13b-hf", "harness_truthfulqa_mc_0", split="train") ``` ## Latest results These are the [latest results from run 2023-10-10T14:01:34.065508](https://huggingface.co/datasets/open-llm-leaderboard/details_Community-LM__llava-v1.5-13b-hf/blob/main/results_2023-10-10T14-01-34.065508.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.5687974861474466, "acc_stderr": 0.034102420636387375, "acc_norm": 0.5727205361494934, "acc_norm_stderr": 0.034085436281331656, "mc1": 0.3011015911872705, "mc1_stderr": 0.016058999026100612, "mc2": 0.433460825483405, "mc2_stderr": 0.01517244922847158 }, "harness|arc:challenge|25": { "acc": 0.5324232081911263, "acc_stderr": 0.01458063756999542, "acc_norm": 0.5614334470989761, "acc_norm_stderr": 0.014500682618212864 }, "harness|hellaswag|10": { "acc": 0.6011750647281418, "acc_stderr": 0.004886559008754983, "acc_norm": 0.8036247759410476, "acc_norm_stderr": 0.003964437012249994 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.4962962962962963, "acc_stderr": 0.04319223625811331, "acc_norm": 0.4962962962962963, "acc_norm_stderr": 0.04319223625811331 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.5855263157894737, "acc_stderr": 0.04008973785779206, "acc_norm": 0.5855263157894737, "acc_norm_stderr": 0.04008973785779206 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.67, "acc_stderr": 0.047258156262526094, "acc_norm": 0.67, "acc_norm_stderr": 0.047258156262526094 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6037735849056604, "acc_stderr": 0.030102793781791197, "acc_norm": 0.6037735849056604, "acc_norm_stderr": 0.030102793781791197 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.6041666666666666, "acc_stderr": 0.04089465449325582, "acc_norm": 0.6041666666666666, "acc_norm_stderr": 0.04089465449325582 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.39, "acc_stderr": 0.04902071300001975, "acc_norm": 0.39, "acc_norm_stderr": 0.04902071300001975 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.45, "acc_stderr": 0.049999999999999996, "acc_norm": 0.45, "acc_norm_stderr": 0.049999999999999996 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.36, "acc_stderr": 0.04824181513244218, "acc_norm": 0.36, "acc_norm_stderr": 0.04824181513244218 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.5375722543352601, "acc_stderr": 0.0380168510452446, "acc_norm": 0.5375722543352601, "acc_norm_stderr": 0.0380168510452446 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.2647058823529412, "acc_stderr": 0.043898699568087764, "acc_norm": 0.2647058823529412, "acc_norm_stderr": 0.043898699568087764 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.68, "acc_stderr": 0.04688261722621505, "acc_norm": 0.68, "acc_norm_stderr": 0.04688261722621505 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.44680851063829785, "acc_stderr": 0.0325005368436584, "acc_norm": 0.44680851063829785, "acc_norm_stderr": 0.0325005368436584 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.3333333333333333, "acc_stderr": 0.044346007015849245, "acc_norm": 0.3333333333333333, "acc_norm_stderr": 0.044346007015849245 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5103448275862069, "acc_stderr": 0.04165774775728763, "acc_norm": 0.5103448275862069, "acc_norm_stderr": 0.04165774775728763 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.328042328042328, "acc_stderr": 0.0241804971643769, "acc_norm": 0.328042328042328, "acc_norm_stderr": 0.0241804971643769 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.35714285714285715, "acc_stderr": 0.04285714285714281, "acc_norm": 0.35714285714285715, "acc_norm_stderr": 0.04285714285714281 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7129032258064516, "acc_stderr": 0.025736542745594528, "acc_norm": 0.7129032258064516, "acc_norm_stderr": 0.025736542745594528 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.42857142857142855, "acc_stderr": 0.03481904844438803, "acc_norm": 0.42857142857142855, "acc_norm_stderr": 0.03481904844438803 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.58, "acc_stderr": 0.049604496374885836, "acc_norm": 0.58, "acc_norm_stderr": 0.049604496374885836 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7151515151515152, "acc_stderr": 0.03524390844511781, "acc_norm": 0.7151515151515152, "acc_norm_stderr": 0.03524390844511781 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7424242424242424, "acc_stderr": 0.031156269519646836, "acc_norm": 0.7424242424242424, "acc_norm_stderr": 0.031156269519646836 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8393782383419689, "acc_stderr": 0.026499057701397433, "acc_norm": 0.8393782383419689, "acc_norm_stderr": 0.026499057701397433 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.5384615384615384, "acc_stderr": 0.025275892070240644, "acc_norm": 0.5384615384615384, "acc_norm_stderr": 0.025275892070240644 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3148148148148148, "acc_stderr": 0.028317533496066475, "acc_norm": 0.3148148148148148, "acc_norm_stderr": 0.028317533496066475 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.5672268907563025, "acc_stderr": 0.032183581077426124, "acc_norm": 0.5672268907563025, "acc_norm_stderr": 0.032183581077426124 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.26490066225165565, "acc_stderr": 0.03603038545360384, "acc_norm": 0.26490066225165565, "acc_norm_stderr": 0.03603038545360384 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.7577981651376147, "acc_stderr": 0.018368176306598618, "acc_norm": 0.7577981651376147, "acc_norm_stderr": 0.018368176306598618 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.42592592592592593, "acc_stderr": 0.03372343271653063, "acc_norm": 0.42592592592592593, "acc_norm_stderr": 0.03372343271653063 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.7549019607843137, "acc_stderr": 0.030190282453501947, "acc_norm": 0.7549019607843137, "acc_norm_stderr": 0.030190282453501947 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7805907172995781, "acc_stderr": 0.026939106581553945, "acc_norm": 0.7805907172995781, "acc_norm_stderr": 0.026939106581553945 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6681614349775785, "acc_stderr": 0.03160295143776678, "acc_norm": 0.6681614349775785, "acc_norm_stderr": 0.03160295143776678 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.6412213740458015, "acc_stderr": 0.04206739313864908, "acc_norm": 0.6412213740458015, "acc_norm_stderr": 0.04206739313864908 }, "harness|hendrycksTest-international_law|5": { "acc": 0.71900826446281, "acc_stderr": 0.04103203830514512, "acc_norm": 0.71900826446281, "acc_norm_stderr": 0.04103203830514512 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7314814814814815, "acc_stderr": 0.042844679680521934, "acc_norm": 0.7314814814814815, "acc_norm_stderr": 0.042844679680521934 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.6319018404907976, "acc_stderr": 0.03789213935838396, "acc_norm": 0.6319018404907976, "acc_norm_stderr": 0.03789213935838396 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.41964285714285715, "acc_stderr": 0.04684099321077106, "acc_norm": 0.41964285714285715, "acc_norm_stderr": 0.04684099321077106 }, "harness|hendrycksTest-management|5": { "acc": 0.7572815533980582, "acc_stderr": 0.04245022486384493, "acc_norm": 0.7572815533980582, "acc_norm_stderr": 0.04245022486384493 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8461538461538461, "acc_stderr": 0.02363687331748928, "acc_norm": 0.8461538461538461, "acc_norm_stderr": 0.02363687331748928 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.64, "acc_stderr": 0.048241815132442176, "acc_norm": 0.64, "acc_norm_stderr": 0.048241815132442176 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.7739463601532567, "acc_stderr": 0.014957458504335835, "acc_norm": 0.7739463601532567, "acc_norm_stderr": 0.014957458504335835 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.6271676300578035, "acc_stderr": 0.02603389061357628, "acc_norm": 0.6271676300578035, "acc_norm_stderr": 0.02603389061357628 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.3240223463687151, "acc_stderr": 0.015652542496421114, "acc_norm": 0.3240223463687151, "acc_norm_stderr": 0.015652542496421114 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.6078431372549019, "acc_stderr": 0.027956046165424523, "acc_norm": 0.6078431372549019, "acc_norm_stderr": 0.027956046165424523 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.6237942122186495, "acc_stderr": 0.02751392568354943, "acc_norm": 0.6237942122186495, "acc_norm_stderr": 0.02751392568354943 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.6975308641975309, "acc_stderr": 0.025557653981868045, "acc_norm": 0.6975308641975309, "acc_norm_stderr": 0.025557653981868045 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.4078014184397163, "acc_stderr": 0.029316011776343555, "acc_norm": 0.4078014184397163, "acc_norm_stderr": 0.029316011776343555 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.41590612777053454, "acc_stderr": 0.012588323850313608, "acc_norm": 0.41590612777053454, "acc_norm_stderr": 0.012588323850313608 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.5477941176470589, "acc_stderr": 0.030233758551596445, "acc_norm": 0.5477941176470589, "acc_norm_stderr": 0.030233758551596445 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.5784313725490197, "acc_stderr": 0.019977422600227477, "acc_norm": 0.5784313725490197, "acc_norm_stderr": 0.019977422600227477 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6, "acc_stderr": 0.0469237132203465, "acc_norm": 0.6, "acc_norm_stderr": 0.0469237132203465 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.6530612244897959, "acc_stderr": 0.030472526026726496, "acc_norm": 0.6530612244897959, "acc_norm_stderr": 0.030472526026726496 }, "harness|hendrycksTest-sociology|5": { "acc": 0.7611940298507462, "acc_stderr": 0.03014777593540922, "acc_norm": 0.7611940298507462, "acc_norm_stderr": 0.03014777593540922 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.81, "acc_stderr": 0.03942772444036625, "acc_norm": 0.81, "acc_norm_stderr": 0.03942772444036625 }, "harness|hendrycksTest-virology|5": { "acc": 0.5060240963855421, "acc_stderr": 0.03892212195333045, "acc_norm": 0.5060240963855421, "acc_norm_stderr": 0.03892212195333045 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.7953216374269005, "acc_stderr": 0.03094445977853321, "acc_norm": 0.7953216374269005, "acc_norm_stderr": 0.03094445977853321 }, "harness|truthfulqa:mc|0": { "mc1": 0.3011015911872705, "mc1_stderr": 0.016058999026100612, "mc2": 0.433460825483405, "mc2_stderr": 0.01517244922847158 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

该数据集是在对模型 Community-LM/llava-v1.5-13b-hf 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集组成

  • 数据集包含 61 个配置,每个配置对应一个评估任务。
  • 数据集从 1 次运行中创建,每个运行可以在每个配置中作为一个特定的分割找到,分割名称使用运行的时间戳。
  • "train" 分割总是指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Community-LM__llava-v1.5-13b-hf", "harness_truthfulqa_mc_0", split="train")

最新结果

以下是 2023-10-10T14:01:34.065508 运行的最新结果

python { "all": { "acc": 0.5687974861474466, "acc_stderr": 0.034102420636387375, "acc_norm": 0.5727205361494934, "acc_norm_stderr": 0.034085436281331656, "mc1": 0.3011015911872705, "mc1_stderr": 0.016058999026100612, "mc2": 0.433460825483405, "mc2_stderr": 0.01517244922847158 }, "harness|arc:challenge|25": { "acc": 0.5324232081911263, "acc_stderr": 0.01458063756999542, "acc_norm": 0.5614334470989761, "acc_norm_stderr": 0.014500682618212864 }, "harness|hellaswag|10": { "acc": 0.6011750647281418, "acc_stderr": 0.004886559008754983, "acc_norm": 0.8036247759410476, "acc_norm_stderr": 0.003964437012249994 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.4962962962962963, "acc_stderr": 0.04319223625811331, "acc_norm": 0.4962962962962963, "acc_norm_stderr": 0.04319223625811331 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.5855263157894737, "acc_stderr": 0.04008973785779206, "acc_norm": 0.5855263157894737, "acc_norm_stderr": 0.04008973785779206 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.67, "acc_stderr": 0.047258156262526094, "acc_norm": 0.67, "acc_norm_stderr": 0.047258156262526094 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6037735849056604, "acc_stderr": 0.030102793781791197, "acc_norm": 0.6037735849056604, "acc_norm_stderr": 0.030102793781791197 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.6041666666666666, "acc_stderr": 0.04089465449325582, "acc_norm": 0.6041666666666666, "acc_norm_stderr": 0.04089465449325582 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.39, "acc_stderr": 0.04902071300001975, "acc_norm": 0.39, "acc_norm_stderr": 0.04902071300001975 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.45, "acc_stderr": 0.049999999999999996, "acc_norm": 0.45, "acc_norm_stderr": 0.049999999999999996 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.36, "acc_stderr": 0.04824181513244218, "acc_norm": 0.36, "acc_norm_stderr": 0.04824181513244218 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.5375722543352601, "acc_stderr": 0.0380168510452446, "acc_norm": 0.5375722543352601, "acc_norm_stderr": 0.0380168510452446 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.2647058823529412, "acc_stderr": 0.043898699568087764, "acc_norm": 0.2647058823529412, "acc_norm_stderr": 0.043898699568087764 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.68, "acc_stderr": 0.04688261722621505, "acc_norm": 0.68, "acc_norm_stderr": 0.04688261722621505 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.44680851063829785, "acc_stderr": 0.0325005368436584, "acc_norm": 0.44680851063829785, "acc_norm_stderr": 0.0325005368436584 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.3333333333333333, "acc_stderr": 0.044346007015849245, "acc_norm": 0.3333333333333333, "acc_norm_stderr": 0.044346007015849245 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5103448275862069, "acc_stderr": 0.04165774775728763, "acc_norm": 0.5103448275862069, "acc_norm_stderr": 0.04165774775728763 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.328042328042328, "acc_stderr": 0.0241804971643769, "acc_norm": 0.328042328042328, "acc_norm_stderr": 0.0241804971643769 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.35714285714285715, "acc_stderr": 0.04285714285714281, "acc_norm": 0.35714285714285715, "acc_norm_stderr": 0.04285714285714281 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7129032258064516, "acc_stderr": 0.025736542745594528, "acc_norm": 0.7129032258064516, "acc_norm_stderr": 0.025736542745594528 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.42857142857142855, "acc_stderr": 0.03481904844438803, "acc_norm": 0.42857142857142855, "acc_norm_stderr": 0.03481904844438803 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.58, "acc_stderr": 0.049604496374885836, "acc_norm": 0.58, "acc_norm_stderr": 0.049604496374885836 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7151515151515152, "acc_stderr": 0.03524390844511781, "acc_norm": 0.7151515151515152, "acc_norm_stderr": 0.03524390844511781 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7424242424242424, "acc_stderr": 0.031156269519646836, "acc_norm": 0.7424242424242424, "acc_norm_stderr": 0.031156269519646836 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8393782383419689, "acc_stderr": 0.026499057701397433, "acc_norm": 0.8393782383419689, "acc_norm_stderr": 0.026499057701397433 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.5384615384615384, "acc_stderr": 0.025275892070240644, "acc_norm": 0.5384615384615384, "acc_norm_stderr": 0.025275892070240644 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3148148148148148, "acc_stderr": 0.028317533496066475, "acc_norm": 0.31481481481

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是在Open LLM Leaderboard评测框架下,对Community-LM/llava-v1.5-13b-hf模型进行自动化评估过程中生成的。数据集共包含61个配置,每个配置对应一项被评估的任务。整个数据集源自一次完整的运行,每次运行的结果以时间戳命名,作为独立的split存储于各个配置中,而'train' split则始终指向最新一次的评测结果。此外,还设有专门的'results'配置,用于聚合存储所有任务的汇总指标,支撑排行榜上综合得分的计算与展示。
特点
数据集的显著特征在于其结构化与动态更新能力。它通过61个独立配置系统性地覆盖了从常识推理、学科知识到伦理判断等多维度的评测任务,如ARC、HellaSwag、TruthfulQA以及涵盖57个学科的MMLU基准。每个配置内均保留历史运行记录,便于追踪模型性能的演变轨迹,而'latest' split的设计则确保了研究者能便捷获取最新评测数据。这种设计兼顾了纵向对比与横向评估的需求。
使用方法
研究者可通过HuggingFace的datasets库灵活加载该数据集。例如,使用load_dataset函数并指定目标配置名称(如'harness_truthfulqa_mc_0')及split(如'train'),即可获取特定任务的最新评测细节。对于需要分析历史结果的情况,可直接调用以时间戳命名的split。此外,通过访问'results'配置,可以一站式获取所有任务的聚合指标,便于进行模型的整体性能分析与比较。
背景与挑战
背景概述
随着大型语言模型在自然语言处理领域的迅猛发展,如何系统性地评估其多维度能力成为学界与工业界共同关注的焦点。在此背景下,Hugging Face团队于2023年发起了Open LLM Leaderboard项目,旨在构建一个标准化、可复现的模型评估基准。该数据集正是该项目针对Community-LM团队开发的llava-v1.5-13b-hf多模态模型的一次完整评估记录,创建于2023年10月10日,由Clementine Fourrier等研究人员主导。数据集涵盖了61个评估任务配置,包括ARC挑战赛、HellaSwag常识推理、以及涵盖57个学科领域的MMLU基准测试等,系统性地考察了模型在推理、知识理解与事实性判断等方面的综合表现。这些详尽的评估结果不仅为模型开发者提供了精确的性能画像,也为后续模型优化与跨模型比较奠定了坚实的数据基础,在推动开放语言模型评估标准化进程中发挥了关键作用。
当前挑战
该数据集所反映的核心挑战在于多模态语言模型评估的维度复杂性与标准化难题。首先,从领域问题层面看,llava-v1.5-13b-hf作为视觉-语言模型,其评估需要同时覆盖文本推理与视觉理解能力,而现有基准如MMLU虽涵盖广泛学科,却缺乏对视觉语义理解的有效测度,导致模型在图文关联任务上的真实性能难以全面刻画。其次,在构建过程中,数据集面临评估指标选择与结果可比性的双重挑战:不同任务采用准确率、标准化准确率、多项选择匹配等多元指标,如何将这些异构指标整合为统一的性能表征,并确保跨模型、跨时间轮次的评估结果具有统计显著性与可复现性,是技术实现上的重大难题。此外,自动评估流程中数据格式的规范化、运行时间戳的精确管理,以及海量评估结果的存储与检索,也对基础设施的鲁棒性与可扩展性提出了严苛要求。
常用场景
经典使用场景
该数据集是Open LLM Leaderboard对Community-LM/llava-v1.5-13b-hf模型进行自动化评测过程中生成的详细结果记录,涵盖61个评测任务配置,每个配置对应一个标准化的语言理解与推理基准。其经典使用场景在于为大语言模型研究者提供细粒度的性能剖析,通过加载特定任务(如ARC挑战赛、HellaSwag、MMLU等)的评测数据,深入分析模型在不同知识领域和推理难度下的表现差异。
解决学术问题
该数据集解决了大语言模型评测中结果可复现性与透明度不足的学术难题。传统上,模型性能仅以聚合指标呈现,缺乏对单个任务或样本级表现的记录。该数据集通过结构化存储每次评测的完整结果(包括准确率、标准误差等统计量),使研究者能够追溯模型在57个学科(从抽象代数到病毒学)上的具体表现,为识别模型知识盲区、分析能力偏向提供了标准化数据基础。
衍生相关工作
该数据集衍生了多项重要工作:一是推动了Open LLM Leaderboard评测框架的标准化,其配置结构被后续模型评测广泛采用;二是催生了针对多模态模型(如LLaVA系列)的细粒度分析研究,研究者通过解析该数据集中的任务级得分,揭示了视觉-语言模型在纯文本推理任务上的能力边界;三是促进了评测结果可视化工具的开发,如基于该数据格式的雷达图生成器,用于直观展示模型在57个学科上的能力分布。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作