five

open-llm-leaderboard-old/details_jan-hq__stealth-v1.3

收藏
Hugging Face2024-03-01 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_jan-hq__stealth-v1.3
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of jan-hq/stealth-v1.3 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [jan-hq/stealth-v1.3](https://huggingface.co/jan-hq/stealth-v1.3) on the [Open\ \ LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_jan-hq__stealth-v1.3\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-03-01T13:33:32.733968](https://huggingface.co/datasets/open-llm-leaderboard/details_jan-hq__stealth-v1.3/blob/main/results_2024-03-01T13-33-32.733968.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6489306644624384,\n\ \ \"acc_stderr\": 0.032117814539989575,\n \"acc_norm\": 0.6488111440199534,\n\ \ \"acc_norm_stderr\": 0.03278124580734838,\n \"mc1\": 0.386780905752754,\n\ \ \"mc1_stderr\": 0.017048857010515107,\n \"mc2\": 0.5571199691389221,\n\ \ \"mc2_stderr\": 0.015289284314943528\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.6416382252559727,\n \"acc_stderr\": 0.014012883334859859,\n\ \ \"acc_norm\": 0.6749146757679181,\n \"acc_norm_stderr\": 0.013688147309729122\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6824337781318462,\n\ \ \"acc_stderr\": 0.00464578304800458,\n \"acc_norm\": 0.8673571001792472,\n\ \ \"acc_norm_stderr\": 0.003384951803213478\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.29,\n \"acc_stderr\": 0.045604802157206845,\n \ \ \"acc_norm\": 0.29,\n \"acc_norm_stderr\": 0.045604802157206845\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6222222222222222,\n\ \ \"acc_stderr\": 0.04188307537595853,\n \"acc_norm\": 0.6222222222222222,\n\ \ \"acc_norm_stderr\": 0.04188307537595853\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.7039473684210527,\n \"acc_stderr\": 0.03715062154998905,\n\ \ \"acc_norm\": 0.7039473684210527,\n \"acc_norm_stderr\": 0.03715062154998905\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.63,\n\ \ \"acc_stderr\": 0.04852365870939099,\n \"acc_norm\": 0.63,\n \ \ \"acc_norm_stderr\": 0.04852365870939099\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7056603773584905,\n \"acc_stderr\": 0.02804918631569525,\n\ \ \"acc_norm\": 0.7056603773584905,\n \"acc_norm_stderr\": 0.02804918631569525\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7708333333333334,\n\ \ \"acc_stderr\": 0.03514697467862388,\n \"acc_norm\": 0.7708333333333334,\n\ \ \"acc_norm_stderr\": 0.03514697467862388\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.5,\n \"acc_stderr\": 0.050251890762960605,\n \ \ \"acc_norm\": 0.5,\n \"acc_norm_stderr\": 0.050251890762960605\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.57,\n \"acc_stderr\": 0.04975698519562428,\n \"acc_norm\": 0.57,\n\ \ \"acc_norm_stderr\": 0.04975698519562428\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.34,\n \"acc_stderr\": 0.04760952285695235,\n \ \ \"acc_norm\": 0.34,\n \"acc_norm_stderr\": 0.04760952285695235\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6589595375722543,\n\ \ \"acc_stderr\": 0.036146654241808254,\n \"acc_norm\": 0.6589595375722543,\n\ \ \"acc_norm_stderr\": 0.036146654241808254\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.4215686274509804,\n \"acc_stderr\": 0.04913595201274498,\n\ \ \"acc_norm\": 0.4215686274509804,\n \"acc_norm_stderr\": 0.04913595201274498\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.81,\n \"acc_stderr\": 0.03942772444036624,\n \"acc_norm\": 0.81,\n\ \ \"acc_norm_stderr\": 0.03942772444036624\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.5659574468085107,\n \"acc_stderr\": 0.03240038086792747,\n\ \ \"acc_norm\": 0.5659574468085107,\n \"acc_norm_stderr\": 0.03240038086792747\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.5,\n\ \ \"acc_stderr\": 0.047036043419179864,\n \"acc_norm\": 0.5,\n \ \ \"acc_norm_stderr\": 0.047036043419179864\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5172413793103449,\n \"acc_stderr\": 0.04164188720169375,\n\ \ \"acc_norm\": 0.5172413793103449,\n \"acc_norm_stderr\": 0.04164188720169375\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.42328042328042326,\n \"acc_stderr\": 0.025446365634406793,\n \"\ acc_norm\": 0.42328042328042326,\n \"acc_norm_stderr\": 0.025446365634406793\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4444444444444444,\n\ \ \"acc_stderr\": 0.044444444444444495,\n \"acc_norm\": 0.4444444444444444,\n\ \ \"acc_norm_stderr\": 0.044444444444444495\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.33,\n \"acc_stderr\": 0.04725815626252604,\n \ \ \"acc_norm\": 0.33,\n \"acc_norm_stderr\": 0.04725815626252604\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7677419354838709,\n\ \ \"acc_stderr\": 0.024022256130308235,\n \"acc_norm\": 0.7677419354838709,\n\ \ \"acc_norm_stderr\": 0.024022256130308235\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.5123152709359606,\n \"acc_stderr\": 0.035169204442208966,\n\ \ \"acc_norm\": 0.5123152709359606,\n \"acc_norm_stderr\": 0.035169204442208966\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.7,\n \"acc_stderr\": 0.046056618647183814,\n \"acc_norm\"\ : 0.7,\n \"acc_norm_stderr\": 0.046056618647183814\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7636363636363637,\n \"acc_stderr\": 0.03317505930009181,\n\ \ \"acc_norm\": 0.7636363636363637,\n \"acc_norm_stderr\": 0.03317505930009181\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7828282828282829,\n \"acc_stderr\": 0.029376616484945633,\n \"\ acc_norm\": 0.7828282828282829,\n \"acc_norm_stderr\": 0.029376616484945633\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.8911917098445595,\n \"acc_stderr\": 0.022473253332768776,\n\ \ \"acc_norm\": 0.8911917098445595,\n \"acc_norm_stderr\": 0.022473253332768776\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6512820512820513,\n \"acc_stderr\": 0.02416278028401772,\n \ \ \"acc_norm\": 0.6512820512820513,\n \"acc_norm_stderr\": 0.02416278028401772\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.34444444444444444,\n \"acc_stderr\": 0.02897264888484427,\n \ \ \"acc_norm\": 0.34444444444444444,\n \"acc_norm_stderr\": 0.02897264888484427\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6428571428571429,\n \"acc_stderr\": 0.031124619309328177,\n\ \ \"acc_norm\": 0.6428571428571429,\n \"acc_norm_stderr\": 0.031124619309328177\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.32450331125827814,\n \"acc_stderr\": 0.038227469376587525,\n \"\ acc_norm\": 0.32450331125827814,\n \"acc_norm_stderr\": 0.038227469376587525\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8330275229357799,\n \"acc_stderr\": 0.01599015488507337,\n \"\ acc_norm\": 0.8330275229357799,\n \"acc_norm_stderr\": 0.01599015488507337\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.5046296296296297,\n \"acc_stderr\": 0.03409825519163572,\n \"\ acc_norm\": 0.5046296296296297,\n \"acc_norm_stderr\": 0.03409825519163572\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.8137254901960784,\n \"acc_stderr\": 0.027325470966716312,\n \"\ acc_norm\": 0.8137254901960784,\n \"acc_norm_stderr\": 0.027325470966716312\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7932489451476793,\n \"acc_stderr\": 0.026361651668389094,\n \ \ \"acc_norm\": 0.7932489451476793,\n \"acc_norm_stderr\": 0.026361651668389094\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6905829596412556,\n\ \ \"acc_stderr\": 0.031024411740572213,\n \"acc_norm\": 0.6905829596412556,\n\ \ \"acc_norm_stderr\": 0.031024411740572213\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7480916030534351,\n \"acc_stderr\": 0.03807387116306086,\n\ \ \"acc_norm\": 0.7480916030534351,\n \"acc_norm_stderr\": 0.03807387116306086\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.8016528925619835,\n \"acc_stderr\": 0.03640118271990947,\n \"\ acc_norm\": 0.8016528925619835,\n \"acc_norm_stderr\": 0.03640118271990947\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7962962962962963,\n\ \ \"acc_stderr\": 0.03893542518824847,\n \"acc_norm\": 0.7962962962962963,\n\ \ \"acc_norm_stderr\": 0.03893542518824847\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7852760736196319,\n \"acc_stderr\": 0.032262193772867744,\n\ \ \"acc_norm\": 0.7852760736196319,\n \"acc_norm_stderr\": 0.032262193772867744\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.5089285714285714,\n\ \ \"acc_stderr\": 0.04745033255489123,\n \"acc_norm\": 0.5089285714285714,\n\ \ \"acc_norm_stderr\": 0.04745033255489123\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7864077669902912,\n \"acc_stderr\": 0.040580420156460344,\n\ \ \"acc_norm\": 0.7864077669902912,\n \"acc_norm_stderr\": 0.040580420156460344\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8717948717948718,\n\ \ \"acc_stderr\": 0.021901905115073325,\n \"acc_norm\": 0.8717948717948718,\n\ \ \"acc_norm_stderr\": 0.021901905115073325\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.71,\n \"acc_stderr\": 0.045604802157206845,\n \ \ \"acc_norm\": 0.71,\n \"acc_norm_stderr\": 0.045604802157206845\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8288633461047255,\n\ \ \"acc_stderr\": 0.0134682016140663,\n \"acc_norm\": 0.8288633461047255,\n\ \ \"acc_norm_stderr\": 0.0134682016140663\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7369942196531792,\n \"acc_stderr\": 0.02370309952525817,\n\ \ \"acc_norm\": 0.7369942196531792,\n \"acc_norm_stderr\": 0.02370309952525817\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.4223463687150838,\n\ \ \"acc_stderr\": 0.01651959427529712,\n \"acc_norm\": 0.4223463687150838,\n\ \ \"acc_norm_stderr\": 0.01651959427529712\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7156862745098039,\n \"acc_stderr\": 0.025829163272757482,\n\ \ \"acc_norm\": 0.7156862745098039,\n \"acc_norm_stderr\": 0.025829163272757482\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7041800643086816,\n\ \ \"acc_stderr\": 0.025922371788818763,\n \"acc_norm\": 0.7041800643086816,\n\ \ \"acc_norm_stderr\": 0.025922371788818763\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7530864197530864,\n \"acc_stderr\": 0.02399350170904211,\n\ \ \"acc_norm\": 0.7530864197530864,\n \"acc_norm_stderr\": 0.02399350170904211\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.48226950354609927,\n \"acc_stderr\": 0.02980873964223777,\n \ \ \"acc_norm\": 0.48226950354609927,\n \"acc_norm_stderr\": 0.02980873964223777\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4654498044328553,\n\ \ \"acc_stderr\": 0.012739711554045704,\n \"acc_norm\": 0.4654498044328553,\n\ \ \"acc_norm_stderr\": 0.012739711554045704\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.6838235294117647,\n \"acc_stderr\": 0.028245687391462927,\n\ \ \"acc_norm\": 0.6838235294117647,\n \"acc_norm_stderr\": 0.028245687391462927\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6813725490196079,\n \"acc_stderr\": 0.01885008469646872,\n \ \ \"acc_norm\": 0.6813725490196079,\n \"acc_norm_stderr\": 0.01885008469646872\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6545454545454545,\n\ \ \"acc_stderr\": 0.04554619617541054,\n \"acc_norm\": 0.6545454545454545,\n\ \ \"acc_norm_stderr\": 0.04554619617541054\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7183673469387755,\n \"acc_stderr\": 0.02879518557429129,\n\ \ \"acc_norm\": 0.7183673469387755,\n \"acc_norm_stderr\": 0.02879518557429129\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.835820895522388,\n\ \ \"acc_stderr\": 0.026193923544454115,\n \"acc_norm\": 0.835820895522388,\n\ \ \"acc_norm_stderr\": 0.026193923544454115\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.86,\n \"acc_stderr\": 0.03487350880197769,\n \ \ \"acc_norm\": 0.86,\n \"acc_norm_stderr\": 0.03487350880197769\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5240963855421686,\n\ \ \"acc_stderr\": 0.03887971849597264,\n \"acc_norm\": 0.5240963855421686,\n\ \ \"acc_norm_stderr\": 0.03887971849597264\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8421052631578947,\n \"acc_stderr\": 0.027966785859160893,\n\ \ \"acc_norm\": 0.8421052631578947,\n \"acc_norm_stderr\": 0.027966785859160893\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.386780905752754,\n\ \ \"mc1_stderr\": 0.017048857010515107,\n \"mc2\": 0.5571199691389221,\n\ \ \"mc2_stderr\": 0.015289284314943528\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.8074191002367798,\n \"acc_stderr\": 0.011082538847491906\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.7156937073540561,\n \ \ \"acc_stderr\": 0.01242507818839599\n }\n}\n```" repo_url: https://huggingface.co/jan-hq/stealth-v1.3 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|arc:challenge|25_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|arc:challenge|25_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-03-01T13-33-32.733968.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|gsm8k|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|gsm8k|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hellaswag|10_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hellaswag|10_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-14T07-33-07.818995.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-management|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-management|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-03-01T13-33-32.733968.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-international_law|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-management|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-management|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-marketing|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-sociology|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-virology|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-03-01T13-33-32.733968.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|truthfulqa:mc|0_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|truthfulqa:mc|0_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-03-01T13-33-32.733968.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_01_14T07_33_07.818995 path: - '**/details_harness|winogrande|5_2024-01-14T07-33-07.818995.parquet' - split: 2024_03_01T13_33_32.733968 path: - '**/details_harness|winogrande|5_2024-03-01T13-33-32.733968.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-03-01T13-33-32.733968.parquet' - config_name: results data_files: - split: 2024_01_14T07_33_07.818995 path: - results_2024-01-14T07-33-07.818995.parquet - split: 2024_03_01T13_33_32.733968 path: - results_2024-03-01T13-33-32.733968.parquet - split: latest path: - results_2024-03-01T13-33-32.733968.parquet --- # Dataset Card for Evaluation run of jan-hq/stealth-v1.3 <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [jan-hq/stealth-v1.3](https://huggingface.co/jan-hq/stealth-v1.3) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_jan-hq__stealth-v1.3", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-03-01T13:33:32.733968](https://huggingface.co/datasets/open-llm-leaderboard/details_jan-hq__stealth-v1.3/blob/main/results_2024-03-01T13-33-32.733968.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6489306644624384, "acc_stderr": 0.032117814539989575, "acc_norm": 0.6488111440199534, "acc_norm_stderr": 0.03278124580734838, "mc1": 0.386780905752754, "mc1_stderr": 0.017048857010515107, "mc2": 0.5571199691389221, "mc2_stderr": 0.015289284314943528 }, "harness|arc:challenge|25": { "acc": 0.6416382252559727, "acc_stderr": 0.014012883334859859, "acc_norm": 0.6749146757679181, "acc_norm_stderr": 0.013688147309729122 }, "harness|hellaswag|10": { "acc": 0.6824337781318462, "acc_stderr": 0.00464578304800458, "acc_norm": 0.8673571001792472, "acc_norm_stderr": 0.003384951803213478 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.29, "acc_stderr": 0.045604802157206845, "acc_norm": 0.29, "acc_norm_stderr": 0.045604802157206845 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6222222222222222, "acc_stderr": 0.04188307537595853, "acc_norm": 0.6222222222222222, "acc_norm_stderr": 0.04188307537595853 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7039473684210527, "acc_stderr": 0.03715062154998905, "acc_norm": 0.7039473684210527, "acc_norm_stderr": 0.03715062154998905 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.63, "acc_stderr": 0.04852365870939099, "acc_norm": 0.63, "acc_norm_stderr": 0.04852365870939099 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7056603773584905, "acc_stderr": 0.02804918631569525, "acc_norm": 0.7056603773584905, "acc_norm_stderr": 0.02804918631569525 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7708333333333334, "acc_stderr": 0.03514697467862388, "acc_norm": 0.7708333333333334, "acc_norm_stderr": 0.03514697467862388 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.5, "acc_stderr": 0.050251890762960605, "acc_norm": 0.5, "acc_norm_stderr": 0.050251890762960605 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.57, "acc_stderr": 0.04975698519562428, "acc_norm": 0.57, "acc_norm_stderr": 0.04975698519562428 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.34, "acc_stderr": 0.04760952285695235, "acc_norm": 0.34, "acc_norm_stderr": 0.04760952285695235 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6589595375722543, "acc_stderr": 0.036146654241808254, "acc_norm": 0.6589595375722543, "acc_norm_stderr": 0.036146654241808254 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4215686274509804, "acc_stderr": 0.04913595201274498, "acc_norm": 0.4215686274509804, "acc_norm_stderr": 0.04913595201274498 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.81, "acc_stderr": 0.03942772444036624, "acc_norm": 0.81, "acc_norm_stderr": 0.03942772444036624 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5659574468085107, "acc_stderr": 0.03240038086792747, "acc_norm": 0.5659574468085107, "acc_norm_stderr": 0.03240038086792747 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5, "acc_stderr": 0.047036043419179864, "acc_norm": 0.5, "acc_norm_stderr": 0.047036043419179864 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5172413793103449, "acc_stderr": 0.04164188720169375, "acc_norm": 0.5172413793103449, "acc_norm_stderr": 0.04164188720169375 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.42328042328042326, "acc_stderr": 0.025446365634406793, "acc_norm": 0.42328042328042326, "acc_norm_stderr": 0.025446365634406793 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4444444444444444, "acc_stderr": 0.044444444444444495, "acc_norm": 0.4444444444444444, "acc_norm_stderr": 0.044444444444444495 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.33, "acc_stderr": 0.04725815626252604, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252604 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7677419354838709, "acc_stderr": 0.024022256130308235, "acc_norm": 0.7677419354838709, "acc_norm_stderr": 0.024022256130308235 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5123152709359606, "acc_stderr": 0.035169204442208966, "acc_norm": 0.5123152709359606, "acc_norm_stderr": 0.035169204442208966 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7636363636363637, "acc_stderr": 0.03317505930009181, "acc_norm": 0.7636363636363637, "acc_norm_stderr": 0.03317505930009181 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7828282828282829, "acc_stderr": 0.029376616484945633, "acc_norm": 0.7828282828282829, "acc_norm_stderr": 0.029376616484945633 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8911917098445595, "acc_stderr": 0.022473253332768776, "acc_norm": 0.8911917098445595, "acc_norm_stderr": 0.022473253332768776 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6512820512820513, "acc_stderr": 0.02416278028401772, "acc_norm": 0.6512820512820513, "acc_norm_stderr": 0.02416278028401772 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.34444444444444444, "acc_stderr": 0.02897264888484427, "acc_norm": 0.34444444444444444, "acc_norm_stderr": 0.02897264888484427 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6428571428571429, "acc_stderr": 0.031124619309328177, "acc_norm": 0.6428571428571429, "acc_norm_stderr": 0.031124619309328177 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.32450331125827814, "acc_stderr": 0.038227469376587525, "acc_norm": 0.32450331125827814, "acc_norm_stderr": 0.038227469376587525 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8330275229357799, "acc_stderr": 0.01599015488507337, "acc_norm": 0.8330275229357799, "acc_norm_stderr": 0.01599015488507337 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5046296296296297, "acc_stderr": 0.03409825519163572, "acc_norm": 0.5046296296296297, "acc_norm_stderr": 0.03409825519163572 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.8137254901960784, "acc_stderr": 0.027325470966716312, "acc_norm": 0.8137254901960784, "acc_norm_stderr": 0.027325470966716312 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7932489451476793, "acc_stderr": 0.026361651668389094, "acc_norm": 0.7932489451476793, "acc_norm_stderr": 0.026361651668389094 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6905829596412556, "acc_stderr": 0.031024411740572213, "acc_norm": 0.6905829596412556, "acc_norm_stderr": 0.031024411740572213 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7480916030534351, "acc_stderr": 0.03807387116306086, "acc_norm": 0.7480916030534351, "acc_norm_stderr": 0.03807387116306086 }, "harness|hendrycksTest-international_law|5": { "acc": 0.8016528925619835, "acc_stderr": 0.03640118271990947, "acc_norm": 0.8016528925619835, "acc_norm_stderr": 0.03640118271990947 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7962962962962963, "acc_stderr": 0.03893542518824847, "acc_norm": 0.7962962962962963, "acc_norm_stderr": 0.03893542518824847 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7852760736196319, "acc_stderr": 0.032262193772867744, "acc_norm": 0.7852760736196319, "acc_norm_stderr": 0.032262193772867744 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.5089285714285714, "acc_stderr": 0.04745033255489123, "acc_norm": 0.5089285714285714, "acc_norm_stderr": 0.04745033255489123 }, "harness|hendrycksTest-management|5": { "acc": 0.7864077669902912, "acc_stderr": 0.040580420156460344, "acc_norm": 0.7864077669902912, "acc_norm_stderr": 0.040580420156460344 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8717948717948718, "acc_stderr": 0.021901905115073325, "acc_norm": 0.8717948717948718, "acc_norm_stderr": 0.021901905115073325 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.71, "acc_stderr": 0.045604802157206845, "acc_norm": 0.71, "acc_norm_stderr": 0.045604802157206845 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8288633461047255, "acc_stderr": 0.0134682016140663, "acc_norm": 0.8288633461047255, "acc_norm_stderr": 0.0134682016140663 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7369942196531792, "acc_stderr": 0.02370309952525817, "acc_norm": 0.7369942196531792, "acc_norm_stderr": 0.02370309952525817 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.4223463687150838, "acc_stderr": 0.01651959427529712, "acc_norm": 0.4223463687150838, "acc_norm_stderr": 0.01651959427529712 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7156862745098039, "acc_stderr": 0.025829163272757482, "acc_norm": 0.7156862745098039, "acc_norm_stderr": 0.025829163272757482 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7041800643086816, "acc_stderr": 0.025922371788818763, "acc_norm": 0.7041800643086816, "acc_norm_stderr": 0.025922371788818763 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7530864197530864, "acc_stderr": 0.02399350170904211, "acc_norm": 0.7530864197530864, "acc_norm_stderr": 0.02399350170904211 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.48226950354609927, "acc_stderr": 0.02980873964223777, "acc_norm": 0.48226950354609927, "acc_norm_stderr": 0.02980873964223777 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.4654498044328553, "acc_stderr": 0.012739711554045704, "acc_norm": 0.4654498044328553, "acc_norm_stderr": 0.012739711554045704 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6838235294117647, "acc_stderr": 0.028245687391462927, "acc_norm": 0.6838235294117647, "acc_norm_stderr": 0.028245687391462927 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6813725490196079, "acc_stderr": 0.01885008469646872, "acc_norm": 0.6813725490196079, "acc_norm_stderr": 0.01885008469646872 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6545454545454545, "acc_stderr": 0.04554619617541054, "acc_norm": 0.6545454545454545, "acc_norm_stderr": 0.04554619617541054 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7183673469387755, "acc_stderr": 0.02879518557429129, "acc_norm": 0.7183673469387755, "acc_norm_stderr": 0.02879518557429129 }, "harness|hendrycksTest-sociology|5": { "acc": 0.835820895522388, "acc_stderr": 0.026193923544454115, "acc_norm": 0.835820895522388, "acc_norm_stderr": 0.026193923544454115 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.86, "acc_stderr": 0.03487350880197769, "acc_norm": 0.86, "acc_norm_stderr": 0.03487350880197769 }, "harness|hendrycksTest-virology|5": { "acc": 0.5240963855421686, "acc_stderr": 0.03887971849597264, "acc_norm": 0.5240963855421686, "acc_norm_stderr": 0.03887971849597264 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8421052631578947, "acc_stderr": 0.027966785859160893, "acc_norm": 0.8421052631578947, "acc_norm_stderr": 0.027966785859160893 }, "harness|truthfulqa:mc|0": { "mc1": 0.386780905752754, "mc1_stderr": 0.017048857010515107, "mc2": 0.5571199691389221, "mc2_stderr": 0.015289284314943528 }, "harness|winogrande|5": { "acc": 0.8074191002367798, "acc_stderr": 0.011082538847491906 }, "harness|gsm8k|5": { "acc": 0.7156937073540561, "acc_stderr": 0.01242507818839599 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

该数据集是在对模型 jan-hq/stealth-v1.3 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集结构

  • 数据集包含 63 个配置,每个配置对应一个评估任务。
  • 数据集从 2 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_jan-hq__stealth-v1.3", "harness_winogrande_5", split="train")

最新结果

以下是 2024-03-01T13:33:32.733968 运行 的最新结果:

python { "all": { "acc": 0.6489306644624384, "acc_stderr": 0.032117814539989575, "acc_norm": 0.6488111440199534, "acc_norm_stderr": 0.03278124580734838, "mc1": 0.386780905752754, "mc1_stderr": 0.017048857010515107, "mc2": 0.5571199691389221, "mc2_stderr": 0.015289284314943528 }, "harness|arc:challenge|25": { "acc": 0.6416382252559727, "acc_stderr": 0.014012883334859859, "acc_norm": 0.6749146757679181, "acc_norm_stderr": 0.013688147309729122 }, "harness|hellaswag|10": { "acc": 0.6824337781318462, "acc_stderr": 0.00464578304800458, "acc_norm": 0.8673571001792472, "acc_norm_stderr": 0.003384951803213478 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.29, "acc_stderr": 0.045604802157206845, "acc_norm": 0.29, "acc_norm_stderr": 0.045604802157206845 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6222222222222222, "acc_stderr": 0.04188307537595853, "acc_norm": 0.6222222222222222, "acc_norm_stderr": 0.04188307537595853 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7039473684210527, "acc_stderr": 0.03715062154998905, "acc_norm": 0.7039473684210527, "acc_norm_stderr": 0.03715062154998905 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.63, "acc_stderr": 0.04852365870939099, "acc_norm": 0.63, "acc_norm_stderr": 0.04852365870939099 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7056603773584905, "acc_stderr": 0.02804918631569525, "acc_norm": 0.7056603773584905, "acc_norm_stderr": 0.02804918631569525 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7708333333333334, "acc_stderr": 0.03514697467862388, "acc_norm": 0.7708333333333334, "acc_norm_stderr": 0.03514697467862388 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.5, "acc_stderr": 0.050251890762960605, "acc_norm": 0.5, "acc_norm_stderr": 0.050251890762960605 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.57, "acc_stderr": 0.04975698519562428, "acc_norm": 0.57, "acc_norm_stderr": 0.04975698519562428 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.34, "acc_stderr": 0.04760952285695235, "acc_norm": 0.34, "acc_norm_stderr": 0.04760952285695235 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6589595375722543, "acc_stderr": 0.036146654241808254, "acc_norm": 0.6589595375722543, "acc_norm_stderr": 0.036146654241808254 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4215686274509804, "acc_stderr": 0.04913595201274498, "acc_norm": 0.4215686274509804, "acc_norm_stderr": 0.04913595201274498 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.81, "acc_stderr": 0.03942772444036624, "acc_norm": 0.81, "acc_norm_stderr": 0.03942772444036624 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5659574468085107, "acc_stderr": 0.03240038086792747, "acc_norm": 0.5659574468085107, "acc_norm_stderr": 0.03240038086792747 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5, "acc_stderr": 0.047036043419179864, "acc_norm": 0.5, "acc_norm_stderr": 0.047036043419179864 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5172413793103449, "acc_stderr": 0.04164188720169375, "acc_norm": 0.5172413793103449, "acc_norm_stderr": 0.04164188720169375 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.42328042328042326, "acc_stderr": 0.025446365634406793, "acc_norm": 0.42328042328042326, "acc_norm_stderr": 0.025446365634406793 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4444444444444444, "acc_stderr": 0.044444444444444495, "acc_norm": 0.4444444444444444, "acc_norm_stderr": 0.044444444444444495 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.33, "acc_stderr": 0.04725815626252604, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252604 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7677419354838709, "acc_stderr": 0.024022256130308235, "acc_norm": 0.7677419354838709, "acc_norm_stderr": 0.024022256130308235 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5123152709359606, "acc_stderr": 0.035169204442208966, "acc_norm": 0.5123152709359606, "acc_norm_stderr": 0.035169204442208966 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7636363636363637, "acc_stderr": 0.03317505930009181, "acc_norm": 0.7636363636363637, "acc_norm_stderr": 0.03317505930009181 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7828282828282829, "acc_stderr": 0.029376616484945633, "acc_norm": 0.7828282828282829, "acc_norm_stderr": 0.029376616484945633 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8911917098445595, "acc_stderr": 0.022473253332768776, "acc_norm": 0.8911917098445595, "acc_norm_stderr": 0.022473253332768776 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6512820512820513, "acc_stderr": 0.02416278028401772, "acc_norm": 0.6512820512820513, "acc_norm_stderr": 0.02416278028401772 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.34444444444444444, "acc_stderr": 0.02897264888484427, "acc_norm": 0.34444444444444444, "acc_norm_stderr": 0.02897264888484427 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.64

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,对模型性能的精确量化是推动技术进步的关键。该数据集源自Open LLM Leaderboard对jan-hq/stealth-v1.3模型的评测过程,通过自动化流水线生成。它整合了63个配置,每个配置对应一项具体的评估任务,如ARC挑战赛、HellaSwag、GSM8K等。数据来自两次独立运行,每次运行的结果以时间戳命名的分割形式存储,而“train”分割则始终指向最新一次的成果。此外,一个名为“results”的配置专门汇总了所有运行的聚合指标,为Leaderboard的排名计算提供依据。数据集文件采用Parquet格式存储,确保了高效的数据读取与处理。
使用方法
研究者可通过HuggingFace Datasets库便捷地加载该数据集的特定任务数据。例如,使用`load_dataset`函数并指定配置名称如“harness_winogrande_5”及分割参数“train”,即可获取最新的评估结果。若需访问历史运行数据,则可选用对应时间戳的分割标识。数据集中的“results”配置提供了聚合后的总体指标,而每个任务配置下的详细记录则允许进行细粒度的误差分析。这种灵活的数据访问方式,使得该数据集成为模型性能基准测试与能力诊断的理想工具。
背景与挑战
背景概述
该数据集源自HuggingFace社区发起的Open LLM Leaderboard评测项目,由HuggingFace团队(联系人clementine@hf.co)于2024年创建,旨在系统评估jan-hq/stealth-v1.3模型在多种自然语言理解与推理任务上的表现。数据集覆盖了ARC挑战赛、HellaSwag、GSM8K、WinoGrande以及涵盖57个学科领域的HendrycksTest等基准,共计63个配置项,每个配置对应一项评测任务。通过多次运行(如2024年1月和3月的两次评测),数据集存储了模型在各任务上的详细性能指标(如准确率、标准差等),为社区提供了透明、可复现的模型能力对标结果。该数据集对推动大语言模型标准化评测、促进模型间公平比较具有重要影响力,尤其为研究者在常识推理、数学推理和知识问答等核心能力维度上提供了量化依据。
当前挑战
构建该数据集面临的核心挑战包括:1)领域问题层面,大语言模型评测需覆盖多维度能力(如逻辑推理、数学计算、常识理解),但现有基准(如HendrycksTest)在学科深度与广度间难以平衡,部分任务(如高中物理、高等数学)准确率仅30%-50%,凸显模型在专业领域知识上的系统性短板;2)构建过程中,需确保多次评测运行间的数据一致性,如不同时间戳(2024-01-14与2024-03-01)的评测结果需通过独立配置和清晰的分割标识(如latest指向最新结果)来管理,避免版本混淆;3)评测指标(如acc_norm、mc2)的计算与标准化需统一,以应对不同任务(如TruthfulQA的多选指标)对准确率定义的差异,同时需解决随机种子、few-shot设置等超参数对结果稳定性的影响。
常用场景
经典使用场景
在大规模语言模型性能评估的学术领域,该数据集作为Open LLM Leaderboard的标准化评测载体,被广泛用于衡量模型在多维度任务上的综合能力。其经典使用场景涵盖ARC挑战赛、HellaSwag常识推理、GSM8K数学问题求解以及涵盖57个学科的MMLU基准测试,研究者通过加载特定配置的parquet文件,能够系统性地复现模型在零样本或少样本设定下的表现,从而为模型迭代提供可量化的参照。
解决学术问题
该数据集精准解决了大语言模型评测中缺乏统一、细粒度结果记录的学术痛点。传统评估往往仅报告宏观指标,而此数据集通过存储每个任务配置的详细得分(如accuracy、norm accuracy及其标准误差),使研究者能够深入分析模型在推理、知识、数学等子领域的优劣。这极大促进了模型鲁棒性分析与弱点诊断,为后续优化提供了数据驱动的决策依据,推动了评测范式的科学化与透明化。
实际应用
在实际应用中,该数据集成为模型开发者进行版本迭代与竞争力对标的核心工具。开发者可通过对比不同时间戳的运行结果,直观追踪模型性能的演进轨迹。例如,在模型部署前,利用该数据集包含的GSM8K与TruthfulQA评测结果,可快速验证模型在数学推理与事实一致性方面的实际表现,从而为智能客服、教育辅助等工业级场景的模型选型提供可靠基准。
数据集最近研究
最新研究方向
该数据集聚焦于大语言模型在开放排行榜上的标准化评测,涵盖ARC挑战、HellaSwag、MMLU、TruthfulQA、WinoGrande及GSM8K等多项任务。当前前沿研究方向集中于模型在复杂推理与知识整合上的表现,尤其是针对数学推理(GSM8K)和常识推理(HellaSwag)的精度提升。与此相关联的热点事件包括社区对模型安全性与事实一致性的关注,如TruthfulQA评测中mc1与mc2分数所揭示的模型诚实度挑战。该数据集通过提供细粒度的评测结果,为模型迭代与基准比较提供了关键参考,推动了开源LLM在多样化任务上的透明化竞争与可重复性研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作