open-llm-leaderboard/details_chavinlo__alpaca-native
收藏数据集概述
该数据集是在对模型 chavinlo/alpaca-native 进行评估时自动创建的,用于 Open LLM Leaderboard。数据集包含 64 个配置,每个配置对应一个评估任务。数据集从 2 次运行中创建,每个运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。"train" 分割始终指向最新的结果。
数据集结构
数据集包含以下配置:
-
harness_arc_challenge_25
- 分割:2023_09_21T20_23_20.255556
- 路径:
**/details_harness|arc:challenge|25_2023-09-21T20-23-20.255556.parquet - 分割:latest
- 路径:
**/details_harness|arc:challenge|25_2023-09-21T20-23-20.255556.parquet
-
harness_drop_3
- 分割:2023_09_17T15_14_48.848140
- 路径:
**/details_harness|drop|3_2023-09-17T15-14-48.848140.parquet - 分割:latest
- 路径:
**/details_harness|drop|3_2023-09-17T15-14-48.848140.parquet
-
harness_gsm8k_5
- 分割:2023_09_17T15_14_48.848140
- 路径:
**/details_harness|gsm8k|5_2023-09-17T15-14-48.848140.parquet - 分割:latest
- 路径:
**/details_harness|gsm8k|5_2023-09-17T15-14-48.848140.parquet
-
harness_hellaswag_10
- 分割:2023_09_21T20_23_20.255556
- 路径:
**/details_harness|hellaswag|10_2023-09-21T20-23-20.255556.parquet - 分割:latest
- 路径:
**/details_harness|hellaswag|10_2023-09-21T20-23-20.255556.parquet
-
harness_hendrycksTest_5
- 分割:2023_09_21T20_23_20.255556
- 路径:
**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-anatomy|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-astronomy|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-business_ethics|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-college_biology|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-college_chemistry|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-college_computer_science|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-college_mathematics|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-college_medicine|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-college_physics|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-computer_security|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-econometrics|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-formal_logic|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-global_facts|5_2023-09-21T20-23-20.255556.parquet**/details_harness|hendrycksTest-high_school_biology|5_2023-09-21T20-23-20.255556.parquet
最新结果
以下是 2023-09-21T20:23:20.255556 运行 的最新结果:
python { "all": { "acc": 0.41927597389078103, "acc_stderr": 0.035302205782678654, "acc_norm": 0.42235476219088836, "acc_norm_stderr": 0.035290265393035695, "mc1": 0.2484700122399021, "mc1_stderr": 0.015127427096520674, "mc2": 0.3759916250814691, "mc2_stderr": 0.015396201572279763 }, "harness|arc:challenge|25": { "acc": 0.5127986348122867, "acc_stderr": 0.014606603181012538, "acc_norm": 0.5204778156996587, "acc_norm_stderr": 0.01459913135303501 }, "harness|hellaswag|10": { "acc": 0.5959968133837881, "acc_stderr": 0.004896952378506926, "acc_norm": 0.7699661422027485, "acc_norm_stderr": 0.004199941217549452 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.28, "acc_stderr": 0.04512608598542129, "acc_norm": 0.28, "acc_norm_stderr": 0.04512608598542129 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.45925925925925926, "acc_stderr": 0.04304979692464242, "acc_norm": 0.45925925925925926, "acc_norm_stderr": 0.04304979692464242 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.3618421052631579, "acc_stderr": 0.03910525752849724, "acc_norm": 0.3618421052631579, "acc_norm_stderr": 0.03910525752849724 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.46, "acc_stderr": 0.05009082659620333, "acc_norm": 0.46, "acc_norm_stderr": 0.05009082659620333 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.44150943396226416, "acc_stderr": 0.030561590426731837, "acc_norm": 0.44150943396226416, "acc_norm_stderr": 0.030561590426731837 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.3819444444444444, "acc_stderr": 0.040629907841466674, "acc_norm": 0.3819444444444444, "acc_norm_stderr": 0.040629907841466674 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.41, "acc_stderr": 0.04943110704237102, "acc_norm": 0.41, "acc_norm_stderr": 0.04943110704237102 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.35, "acc_stderr": 0.047937248544110196, "acc_norm": 0.35, "acc_norm_stderr": 0.047937248544110196 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.3815028901734104, "acc_stderr": 0.03703851193099521, "acc_norm": 0.3815028901734104, "acc_norm_stderr": 0.03703851193099521 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.21568627450980393, "acc_stderr": 0.04092563958237656, "acc_norm": 0.21568627450980393, "acc_norm_stderr": 0.04092563958237656 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.54, "acc_stderr": 0.05009082659620332, "acc_norm": 0.54, "acc_norm_stderr": 0.05009082659620332 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.37446808510638296, "acc_stderr": 0.03163910665367291, "acc_norm": 0.37446808510638296, "acc_norm_stderr": 0.03163910665367291 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.2543859649122807, "acc_stderr": 0.040969851398436716, "acc_norm": 0.2543859649122807, "acc_norm_stderr": 0.040969851398436716 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.3655



