open-llm-leaderboard-old/details_AA051611__A0118
收藏数据集概述
数据集来源
该数据集是在模型 AA051611/A0118 在 Open LLM Leaderboard 上的评估运行期间自动创建的。
数据集结构
数据集包含 63 个配置,每个配置对应一个评估任务。数据集从 2 次运行中创建,每次运行可以在每个配置中作为一个特定的分割找到,分割名称使用运行的时间戳。"train" 分割始终指向最新的结果。
额外配置
一个额外的配置 "results" 存储了所有运行的聚合结果,用于计算和显示在 Open LLM Leaderboard 上的聚合指标。
数据加载示例
python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_AA051611__A0118", "harness_winogrande_5", split="train")
最新结果
这些是最新的结果,来自 2024-01-18T23:48:21.810095 的运行: python { "all": { "acc": 0.6750935567286499, "acc_stderr": 0.03150224444254494, "acc_norm": 0.6839013238259298, "acc_norm_stderr": 0.03214560635872275, "mc1": 0.390452876376989, "mc1_stderr": 0.01707823074343144, "mc2": 0.5579325936654852, "mc2_stderr": 0.015526306494139296 }, "harness|arc:challenge|25": { "acc": 0.5691126279863481, "acc_stderr": 0.014471133392642476, "acc_norm": 0.5921501706484642, "acc_norm_stderr": 0.0143610972884497 }, "harness|hellaswag|10": { "acc": 0.6517625970922127, "acc_stderr": 0.004754380554929216, "acc_norm": 0.8378809002190799, "acc_norm_stderr": 0.0036780679944244557 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.37, "acc_stderr": 0.048523658709391, "acc_norm": 0.37, "acc_norm_stderr": 0.048523658709391 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6, "acc_stderr": 0.04232073695151589, "acc_norm": 0.6, "acc_norm_stderr": 0.04232073695151589 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7960526315789473, "acc_stderr": 0.0327900040631005, "acc_norm": 0.7960526315789473, "acc_norm_stderr": 0.0327900040631005 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.72, "acc_stderr": 0.04512608598542128, "acc_norm": 0.72, "acc_norm_stderr": 0.04512608598542128 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7245283018867924, "acc_stderr": 0.027495663683724053, "acc_norm": 0.7245283018867924, "acc_norm_stderr": 0.027495663683724053 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7847222222222222, "acc_stderr": 0.03437079344106135, "acc_norm": 0.7847222222222222, "acc_norm_stderr": 0.03437079344106135 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.51, "acc_stderr": 0.05024183937956912, "acc_norm": 0.51, "acc_norm_stderr": 0.05024183937956912 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.55, "acc_stderr": 0.049999999999999996, "acc_norm": 0.55, "acc_norm_stderr": 0.049999999999999996 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.41, "acc_stderr": 0.049431107042371025, "acc_norm": 0.41, "acc_norm_stderr": 0.049431107042371025 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6647398843930635, "acc_stderr": 0.03599586301247078, "acc_norm": 0.6647398843930635, "acc_norm_stderr": 0.03599586301247078 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.45098039215686275, "acc_stderr": 0.04951218252396264, "acc_norm": 0.45098039215686275, "acc_norm_stderr": 0.04951218252396264 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.79, "acc_stderr": 0.04093601807403326, "acc_norm": 0.79, "acc_norm_stderr": 0.04093601807403326 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.6978723404255319, "acc_stderr": 0.030017554471880557, "acc_norm": 0.6978723404255319, "acc_norm_stderr": 0.030017554471880557 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5175438596491229, "acc_stderr": 0.04700708033551038, "acc_norm": 0.5175438596491229, "acc_norm_stderr": 0.04700708033551038 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.6827586206896552, "acc_stderr": 0.03878352372138622, "acc_norm": 0.6827586206896552, "acc_norm_stderr": 0.03878352372138622 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.5952380952380952, "acc_stderr": 0.025279850397404904, "acc_norm": 0.5952380952380952, "acc_norm_stderr": 0.025279850397404904 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.5396825396825397, "acc_stderr": 0.04458029125470973, "acc_norm": 0.5396825396825397, "acc_norm_stderr": 0.04458029125470973 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.47, "acc_stderr": 0.05016135580465919, "acc_norm": 0.47, "acc_norm_stderr": 0.05016135580465919 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.8161290322580645, "acc_stderr": 0.02203721734026782, "acc_norm": 0.8161290322580645, "acc_norm_stderr": 0.02203721734026782 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5862068965517241, "acc_stderr": 0.03465304488406795, "acc_norm": 0.5862068965517241, "acc_norm_stderr": 0.03465304488406795 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.74, "acc_stderr": 0.0440844002276808, "acc_norm": 0.74, "acc_norm_stderr": 0.0440844002276808 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7212121212121212, "acc_stderr": 0.03501438706296781, "acc_norm": 0.7212121212121212, "acc_norm_stderr": 0.03501438706296781 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.8939393939393939, "acc_stderr": 0.021938047738853137, "acc_norm": 0.8939393939393939, "acc_norm_stderr": 0.021938047738853137 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.9067357512953368, "acc_stderr": 0.020986854593289733, "acc_norm": 0.9067357512953368, "acc_norm_stderr": 0.020986854593289733 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.7256410256410256, "acc_stderr": 0.022622765767493214, "acc_norm": 0.7256410256410256, "acc_norm_stderr": 0.022622765767493214 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3851851851851852, "acc_stderr": 0.029670906124630882, "acc_norm": 0.3851851851851852, "acc_norm_stderr": 0.029670906124630882 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.7521008403361344, "acc_stderr": 0.028047967224176896, "acc_norm": 0.



