five

open-llm-leaderboard/details_AlanRobotics__nanit_v3.2

收藏
Hugging Face2024-04-23 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_AlanRobotics__nanit_v3.2
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是在Open LLM Leaderboard上对模型AlanRobotics/nanit_v3.2进行评估时自动创建的。数据集由63个配置组成,每个配置对应一个评估任务。数据集包含1次运行的结果,每次运行都作为每个配置中的一个特定分割存储。train分割始终指向最新的结果。此外,还有一个results配置存储了所有运行的聚合结果,这些结果用于计算和显示Open LLM Leaderboard上的聚合指标。README还提供了一个示例,展示了如何使用`datasets`库中的`load_dataset`函数加载运行中的详细信息。

该数据集是在Open LLM Leaderboard上对模型AlanRobotics/nanit_v3.2进行评估时自动创建的。数据集由63个配置组成,每个配置对应一个评估任务。数据集包含1次运行的结果,每次运行都作为每个配置中的一个特定分割存储。train分割始终指向最新的结果。此外,还有一个results配置存储了所有运行的聚合结果,这些结果用于计算和显示Open LLM Leaderboard上的聚合指标。README还提供了一个示例,展示了如何使用`datasets`库中的`load_dataset`函数加载运行中的详细信息。
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

数据集名称: Evaluation run of AlanRobotics/nanit_v3.2

数据集描述: 该数据集是在评估模型AlanRobotics/nanit_v3.2运行期间自动创建的,用于Open LLM Leaderboard

数据集组成:

  • 配置数量: 63个
  • 数据来源: 单次运行
  • 数据分割: 每个配置对应一个评估任务,包含特定的时间戳命名分割。"train"分割指向最新结果。
  • 额外配置: "results"用于存储所有聚合的运行结果,用于计算和显示聚合指标。

数据集加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_AlanRobotics__nanit_v3.2", "harness_winogrande_5", split="train")

最新结果摘要

数据集包含多个任务的评估结果,以下为部分任务的准确率(acc)和标准误差(acc_stderr)示例:

  • harness|arc:challenge|25:

    • acc: 0.5691126279863481
    • acc_stderr: 0.014471133392642473
  • harness|hellaswag|10:

    • acc: 0.5640310695080661
    • acc_stderr: 0.004948696280312426
  • harness|hendrycksTest-abstract_algebra|5:

    • acc: 0.29
    • acc_stderr: 0.045604802157206845
  • harness|hendrycksTest-anatomy|5:

    • acc: 0.4740740740740741
    • acc_stderr: 0.04313531696750575
  • harness|hendrycksTest-astronomy|5:

    • acc: 0.5657894736842105
    • acc_stderr: 0.0403356566784832
  • harness|hendrycksTest-business_ethics|5:

    • acc: 0.58
    • acc_stderr: 0.049604496374885836
  • harness|hendrycksTest-clinical_knowledge|5:

    • acc: 0.6075471698113207
    • acc_stderr: 0.03005258057955785
  • harness|hendrycksTest-college_biology|5:

    • acc: 0.6388888888888888
    • acc_stderr: 0.04016660030451233
  • harness|hendrycksTest-college_chemistry|5:

    • acc: 0.43
    • acc_stderr: 0.04975698519562428
  • harness|hendrycksTest-college_computer_science|5:

    • acc: 0.46
    • acc_stderr: 0.05009082659620332
  • harness|hendrycksTest-college_mathematics|5:

    • acc: 0.38
    • acc_stderr: 0.048783173121456344
  • harness|hendrycksTest-college_medicine|5:

    • acc: 0.6184971098265896
    • acc_stderr: 0.03703851193099521
  • harness|hendrycksTest-college_physics|5:

    • acc: 0.3333333333333333
    • acc_stderr: 0.04690650298201943
  • harness|hendrycksTest-computer_security|5:

    • acc: 0.74
    • acc_stderr: 0.04408440022768078
  • harness|hendrycksTest-conceptual_physics|5:

    • acc: 0.5106382978723404
    • acc_stderr: 0.03267862331014063
  • harness|hendrycksTest-econometrics|5:

    • acc: 0.37719298245614036
    • acc_stderr: 0.04559522141958216
  • harness|hendrycksTest-electrical_engineering|5:

    • acc: 0.5724137931034483
    • acc_stderr: 0.04122737111370333
  • harness|hendrycksTest-elementary_mathematics|5:

    • acc: 0.42857142857142855
    • acc_stderr: 0.025487187147859372
  • harness|hendrycksTest-formal_logic|5:

    • acc: 0.3888888888888889
    • acc_stderr: 0.04360314860077459
  • harness|hendrycksTest-global_facts|5:

    • acc: 0.35
    • acc_stderr: 0.047937248544110196
  • harness|hendrycksTest-high_school_biology|5:

    • acc: 0.7129032258064516
    • acc_stderr: 0.025736542745594528
  • harness|hendrycksTest-high_school_chemistry|5:

    • acc: 0.46798029556650245
    • acc_stderr: 0.035107665979592154
  • harness|hendrycksTest-high_school_computer_science|5:

    • acc: 0.59
    • acc_stderr: 0.04943110704237102
  • harness|hendrycksTest-high_school_european_history|5:

    • acc: 0.6606060606060606
    • acc_stderr: 0.036974422050315967
  • harness|hendrycksTest-high_school_geography|5:

    • acc: 0.7626262626262627
    • acc_stderr: 0.0303137105381989
  • harness|hendrycksTest-high_school_government_and_politics|5:

    • acc: 0.7772020725388601
    • acc_stderr: 0.03003114797764154
  • harness|hendrycksTest-high_school_macroeconomics|5:

    • acc: 0.6076923076923076
    • acc_stderr: 0.02475600038213095
  • harness|hendrycksTest-high_school_mathematics|5:

    • acc: 0.32222222222222224
    • acc_stderr: 0.028493465091028604
  • harness|hendrycksTest-high_school_microeconomics|5:

    • acc: 0.6260504201680672
    • acc_stderr: 0.031429466378837076
  • harness|hendrycksTest-high_school_physics|5:

    • acc: 0.3841059602649007
    • acc_stderr: 0.03971301814719197
  • harness|hendrycksTest-high_school_psychology|5:

    • acc: 0.7981651376146789
    • acc_stderr: 0.017208579357787586
  • harness|hendrycksTest-high_school_statistics|5:

    • acc: 0.4861111111111111
    • acc_stderr: 0.03408655867977748
  • harness|hendrycksTest-high_school_us_history|5:

    • acc: 0.6715686274509803
    • acc_stderr: 0.032962451101722294
  • harness|hendrycksTest-high_school_world_history|5:

    • acc: 0.7426160337552743
    • acc_stderr: 0.028458820991460288
  • harness|hendrycksTest-human_aging|5:

    • acc: 0.6278026905829597
    • acc_stderr: 0.03244305283008731
  • harness|hendrycksTest-human_sexuality|5:

    • acc: 0.6717557251908397
    • acc_stderr: 0.04118438565806298
  • harness|hendrycksTest-international_law|5:

    • acc: 0.71900826446281
    • acc_stderr: 0.04103203830514512
  • harness|hendrycksTest-jurisprudence|5:

    • acc: 0.7222222222222222
    • acc_stderr: 0.043300437496507437
  • harness|hendrycksTest-logical_fallacies|5:

    • acc: 0.7484662576687117
    • acc_stderr: 0.03408997886857529
  • harness|hendrycksTest-machine_learning|5:

    • acc: 0.5089285714285714
    • acc_stderr: 0.04745033255489123
  • harness|hendrycksTest-management|5:

    • acc: 0.7475728155339806
    • acc_stderr: 0.04301250399690878
  • harness|hendrycksTest-marketing|5:

    • acc: 0.8247863247863247
    • acc_stderr: 0.024904439098918242
  • harness|hendrycksTest-medical_genetics|5:

    • acc: 0.61
    • acc_stderr: 0.04902071300001975
  • harness|hendrycksTest-miscellaneous|5:

    • acc: 0.7100893997445722
    • acc_stderr: 0.016225017944770964
  • harness|hendrycksTest-moral_disputes|5:

    • acc: 0.6445086705202312
    • acc_stderr: 0.025770292082977254
  • harness|hendrycksTest-moral_scenarios|5:

    • acc: 0.264804469273743
    • acc_stderr: 0.014756906483260664
  • harness|hendrycksTest-nutrition|5:

    • acc: 0.6503267973856209
    • acc_stderr: 0.027305308076274695
  • harness|hendrycksTest-philosophy|5:

    • acc: 0.639871382636656
    • acc_stderr: 0.027264297599804015
  • harness|hendrycksTest-prehistory|5:

    • acc: 0.6172839506172839
    • acc_stderr: 0.027044538138402612
  • harness|hendrycksTest-professional_accounting|5:

    • acc: 0.44680851063829785
    • acc_stderr: 0.029658235097666904
  • harness|hendrycksTest-professional_law|5:

    • acc: 0.423728813559322
    • acc_stderr: 0.012620785155886001
  • harness|hendrycksTest-professional_medicine|5:

    • acc: 0.5036764705882353
    • acc_stderr: 0.030372015885428195
  • harness|hendrycksTest-professional_psychology|5:

    • acc: 0.5571895424836601
    • acc_stderr: 0.020095083154577347
  • harness|hendrycksTest-public_relations|5:

    • acc: 0.6727272727272727
    • acc_stderr: 0.0449429086625209
  • harness|hendrycksTest-security_studies|5:

    • acc: 0.7142857142857143
    • acc_stderr: 0.028920583220675592
  • harness|hendrycksTest-sociology|5:

    • acc: 0.7761194029850746
    • acc_
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作