open-llm-leaderboard/details_AlanRobotics__nanit_v3.2
收藏数据集概述
数据集名称: Evaluation run of AlanRobotics/nanit_v3.2
数据集描述: 该数据集是在评估模型AlanRobotics/nanit_v3.2运行期间自动创建的,用于Open LLM Leaderboard。
数据集组成:
- 配置数量: 63个
- 数据来源: 单次运行
- 数据分割: 每个配置对应一个评估任务,包含特定的时间戳命名分割。"train"分割指向最新结果。
- 额外配置: "results"用于存储所有聚合的运行结果,用于计算和显示聚合指标。
数据集加载示例
python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_AlanRobotics__nanit_v3.2", "harness_winogrande_5", split="train")
最新结果摘要
数据集包含多个任务的评估结果,以下为部分任务的准确率(acc)和标准误差(acc_stderr)示例:
-
harness|arc:challenge|25:
- acc: 0.5691126279863481
- acc_stderr: 0.014471133392642473
-
harness|hellaswag|10:
- acc: 0.5640310695080661
- acc_stderr: 0.004948696280312426
-
harness|hendrycksTest-abstract_algebra|5:
- acc: 0.29
- acc_stderr: 0.045604802157206845
-
harness|hendrycksTest-anatomy|5:
- acc: 0.4740740740740741
- acc_stderr: 0.04313531696750575
-
harness|hendrycksTest-astronomy|5:
- acc: 0.5657894736842105
- acc_stderr: 0.0403356566784832
-
harness|hendrycksTest-business_ethics|5:
- acc: 0.58
- acc_stderr: 0.049604496374885836
-
harness|hendrycksTest-clinical_knowledge|5:
- acc: 0.6075471698113207
- acc_stderr: 0.03005258057955785
-
harness|hendrycksTest-college_biology|5:
- acc: 0.6388888888888888
- acc_stderr: 0.04016660030451233
-
harness|hendrycksTest-college_chemistry|5:
- acc: 0.43
- acc_stderr: 0.04975698519562428
-
harness|hendrycksTest-college_computer_science|5:
- acc: 0.46
- acc_stderr: 0.05009082659620332
-
harness|hendrycksTest-college_mathematics|5:
- acc: 0.38
- acc_stderr: 0.048783173121456344
-
harness|hendrycksTest-college_medicine|5:
- acc: 0.6184971098265896
- acc_stderr: 0.03703851193099521
-
harness|hendrycksTest-college_physics|5:
- acc: 0.3333333333333333
- acc_stderr: 0.04690650298201943
-
harness|hendrycksTest-computer_security|5:
- acc: 0.74
- acc_stderr: 0.04408440022768078
-
harness|hendrycksTest-conceptual_physics|5:
- acc: 0.5106382978723404
- acc_stderr: 0.03267862331014063
-
harness|hendrycksTest-econometrics|5:
- acc: 0.37719298245614036
- acc_stderr: 0.04559522141958216
-
harness|hendrycksTest-electrical_engineering|5:
- acc: 0.5724137931034483
- acc_stderr: 0.04122737111370333
-
harness|hendrycksTest-elementary_mathematics|5:
- acc: 0.42857142857142855
- acc_stderr: 0.025487187147859372
-
harness|hendrycksTest-formal_logic|5:
- acc: 0.3888888888888889
- acc_stderr: 0.04360314860077459
-
harness|hendrycksTest-global_facts|5:
- acc: 0.35
- acc_stderr: 0.047937248544110196
-
harness|hendrycksTest-high_school_biology|5:
- acc: 0.7129032258064516
- acc_stderr: 0.025736542745594528
-
harness|hendrycksTest-high_school_chemistry|5:
- acc: 0.46798029556650245
- acc_stderr: 0.035107665979592154
-
harness|hendrycksTest-high_school_computer_science|5:
- acc: 0.59
- acc_stderr: 0.04943110704237102
-
harness|hendrycksTest-high_school_european_history|5:
- acc: 0.6606060606060606
- acc_stderr: 0.036974422050315967
-
harness|hendrycksTest-high_school_geography|5:
- acc: 0.7626262626262627
- acc_stderr: 0.0303137105381989
-
harness|hendrycksTest-high_school_government_and_politics|5:
- acc: 0.7772020725388601
- acc_stderr: 0.03003114797764154
-
harness|hendrycksTest-high_school_macroeconomics|5:
- acc: 0.6076923076923076
- acc_stderr: 0.02475600038213095
-
harness|hendrycksTest-high_school_mathematics|5:
- acc: 0.32222222222222224
- acc_stderr: 0.028493465091028604
-
harness|hendrycksTest-high_school_microeconomics|5:
- acc: 0.6260504201680672
- acc_stderr: 0.031429466378837076
-
harness|hendrycksTest-high_school_physics|5:
- acc: 0.3841059602649007
- acc_stderr: 0.03971301814719197
-
harness|hendrycksTest-high_school_psychology|5:
- acc: 0.7981651376146789
- acc_stderr: 0.017208579357787586
-
harness|hendrycksTest-high_school_statistics|5:
- acc: 0.4861111111111111
- acc_stderr: 0.03408655867977748
-
harness|hendrycksTest-high_school_us_history|5:
- acc: 0.6715686274509803
- acc_stderr: 0.032962451101722294
-
harness|hendrycksTest-high_school_world_history|5:
- acc: 0.7426160337552743
- acc_stderr: 0.028458820991460288
-
harness|hendrycksTest-human_aging|5:
- acc: 0.6278026905829597
- acc_stderr: 0.03244305283008731
-
harness|hendrycksTest-human_sexuality|5:
- acc: 0.6717557251908397
- acc_stderr: 0.04118438565806298
-
harness|hendrycksTest-international_law|5:
- acc: 0.71900826446281
- acc_stderr: 0.04103203830514512
-
harness|hendrycksTest-jurisprudence|5:
- acc: 0.7222222222222222
- acc_stderr: 0.043300437496507437
-
harness|hendrycksTest-logical_fallacies|5:
- acc: 0.7484662576687117
- acc_stderr: 0.03408997886857529
-
harness|hendrycksTest-machine_learning|5:
- acc: 0.5089285714285714
- acc_stderr: 0.04745033255489123
-
harness|hendrycksTest-management|5:
- acc: 0.7475728155339806
- acc_stderr: 0.04301250399690878
-
harness|hendrycksTest-marketing|5:
- acc: 0.8247863247863247
- acc_stderr: 0.024904439098918242
-
harness|hendrycksTest-medical_genetics|5:
- acc: 0.61
- acc_stderr: 0.04902071300001975
-
harness|hendrycksTest-miscellaneous|5:
- acc: 0.7100893997445722
- acc_stderr: 0.016225017944770964
-
harness|hendrycksTest-moral_disputes|5:
- acc: 0.6445086705202312
- acc_stderr: 0.025770292082977254
-
harness|hendrycksTest-moral_scenarios|5:
- acc: 0.264804469273743
- acc_stderr: 0.014756906483260664
-
harness|hendrycksTest-nutrition|5:
- acc: 0.6503267973856209
- acc_stderr: 0.027305308076274695
-
harness|hendrycksTest-philosophy|5:
- acc: 0.639871382636656
- acc_stderr: 0.027264297599804015
-
harness|hendrycksTest-prehistory|5:
- acc: 0.6172839506172839
- acc_stderr: 0.027044538138402612
-
harness|hendrycksTest-professional_accounting|5:
- acc: 0.44680851063829785
- acc_stderr: 0.029658235097666904
-
harness|hendrycksTest-professional_law|5:
- acc: 0.423728813559322
- acc_stderr: 0.012620785155886001
-
harness|hendrycksTest-professional_medicine|5:
- acc: 0.5036764705882353
- acc_stderr: 0.030372015885428195
-
harness|hendrycksTest-professional_psychology|5:
- acc: 0.5571895424836601
- acc_stderr: 0.020095083154577347
-
harness|hendrycksTest-public_relations|5:
- acc: 0.6727272727272727
- acc_stderr: 0.0449429086625209
-
harness|hendrycksTest-security_studies|5:
- acc: 0.7142857142857143
- acc_stderr: 0.028920583220675592
-
harness|hendrycksTest-sociology|5:
- acc: 0.7761194029850746
- acc_



