five

open-llm-leaderboard/details_Devio__test100

收藏
Hugging Face2023-09-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_Devio__test100
下载链接
链接失效反馈
官方服务:
资源简介:
数据集是在模型Devio/test100的评估运行期间自动创建的,用于Open LLM Leaderboard。数据集包含61个配置,每个配置对应一个评估任务。数据集由1次运行创建,每次运行可以在每个配置中找到,分割名称使用运行的时间戳。train分割始终指向最新的结果。此外,results配置存储了所有运行的聚合结果,并用于计算和显示Open LLM Leaderboard上的聚合指标。

数据集是在模型Devio/test100的评估运行期间自动创建的,用于Open LLM Leaderboard。数据集包含61个配置,每个配置对应一个评估任务。数据集由1次运行创建,每次运行可以在每个配置中找到,分割名称使用运行的时间戳。train分割始终指向最新的结果。此外,results配置存储了所有运行的聚合结果,并用于计算和显示Open LLM Leaderboard上的聚合指标。
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

该数据集是在对模型 Devio/test100 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集组成

  • 数据集包含 61 个配置,每个配置对应一个评估任务。
  • 数据集从 1 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 额外的 "results" 配置存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Devio__test100", "harness_truthfulqa_mc_0", split="train")

最新结果

以下是 2023-09-02T17:29:14.649417 运行的最新结果

python { "all": { "acc": 0.2766497501153852, "acc_stderr": 0.031976576858827, "acc_norm": 0.2798843599297305, "acc_norm_stderr": 0.031981630759923114, "mc1": 0.19706242350061198, "mc1_stderr": 0.013925080734473736, "mc2": 0.3401260823172781, "mc2_stderr": 0.014194140794117406 }, "harness|arc:challenge|25": { "acc": 0.3370307167235495, "acc_stderr": 0.013813476652902272, "acc_norm": 0.37372013651877134, "acc_norm_stderr": 0.014137708601759098 }, "harness|hellaswag|10": { "acc": 0.4312885879306911, "acc_stderr": 0.004942440746328494, "acc_norm": 0.5854411471818363, "acc_norm_stderr": 0.0049163889621423205 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.25, "acc_stderr": 0.04351941398892446, "acc_norm": 0.25, "acc_norm_stderr": 0.04351941398892446 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.22962962962962963, "acc_stderr": 0.03633384414073461, "acc_norm": 0.22962962962962963, "acc_norm_stderr": 0.03633384414073461 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.3355263157894737, "acc_stderr": 0.03842498559395268, "acc_norm": 0.3355263157894737, "acc_norm_stderr": 0.03842498559395268 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.21, "acc_stderr": 0.040936018074033256, "acc_norm": 0.21, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.2943396226415094, "acc_stderr": 0.028049186315695248, "acc_norm": 0.2943396226415094, "acc_norm_stderr": 0.028049186315695248 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.24305555555555555, "acc_stderr": 0.03586879280080341, "acc_norm": 0.24305555555555555, "acc_norm_stderr": 0.03586879280080341 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.42, "acc_stderr": 0.049604496374885836, "acc_norm": 0.42, "acc_norm_stderr": 0.049604496374885836 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.34, "acc_stderr": 0.04760952285695235, "acc_norm": 0.34, "acc_norm_stderr": 0.04760952285695235 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.32947976878612717, "acc_stderr": 0.03583901754736411, "acc_norm": 0.32947976878612717, "acc_norm_stderr": 0.03583901754736411 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.37254901960784315, "acc_stderr": 0.04810840148082633, "acc_norm": 0.37254901960784315, "acc_norm_stderr": 0.04810840148082633 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.18, "acc_stderr": 0.038612291966536955, "acc_norm": 0.18, "acc_norm_stderr": 0.038612291966536955 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.2170212765957447, "acc_stderr": 0.026947483121496217, "acc_norm": 0.2170212765957447, "acc_norm_stderr": 0.026947483121496217 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.23684210526315788, "acc_stderr": 0.039994238792813344, "acc_norm": 0.23684210526315788, "acc_norm_stderr": 0.039994238792813344 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.2413793103448276, "acc_stderr": 0.03565998174135302, "acc_norm": 0.2413793103448276, "acc_norm_stderr": 0.03565998174135302 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.24603174603174602, "acc_stderr": 0.022182037202948368, "acc_norm": 0.24603174603174602, "acc_norm_stderr": 0.022182037202948368 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.36507936507936506, "acc_stderr": 0.04306241259127153, "acc_norm": 0.36507936507936506, "acc_norm_stderr": 0.04306241259127153 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.25, "acc_stderr": 0.04351941398892446, "acc_norm": 0.25, "acc_norm_stderr": 0.04351941398892446 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.3225806451612903, "acc_stderr": 0.02659308451657228, "acc_norm": 0.3225806451612903, "acc_norm_stderr": 0.02659308451657228 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.28078817733990147, "acc_stderr": 0.03161856335358609, "acc_norm": 0.28078817733990147, "acc_norm_stderr": 0.03161856335358609 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.18, "acc_stderr": 0.03861229196653694, "acc_norm": 0.18, "acc_norm_stderr": 0.03861229196653694 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.2545454545454545, "acc_stderr": 0.03401506715249039, "acc_norm": 0.2545454545454545, "acc_norm_stderr": 0.03401506715249039 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.35353535353535354, "acc_stderr": 0.03406086723547153, "acc_norm": 0.35353535353535354, "acc_norm_stderr": 0.03406086723547153 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.36787564766839376, "acc_stderr": 0.03480175668466036, "acc_norm": 0.36787564766839376, "acc_norm_stderr": 0.03480175668466036 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.3641025641025641, "acc_stderr": 0.02439667298509477, "acc_norm": 0.3641025641025641, "acc_norm_stderr": 0.02439667298509477 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.25925925925925924, "acc_stderr": 0.02671924078371216, "acc_norm": 0.25925925925925924, "acc_norm_stderr": 0.02671924078371216 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.3487394957983193, "acc_stderr": 0.03095663632856655, "acc_norm": 0.3487394957983193, "acc_norm_stderr": 0.03095663632856655 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.33112582781456956, "acc_stderr": 0.038425817186598696, "acc_norm": 0.33112582781456956, "acc_norm_stderr": 0.038425817186598696 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.3522935779816514, "acc_stderr": 0.020480568843998997, "acc_norm": 0.3522935779816514, "acc_norm_stderr": 0.020480568843998997 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.4722222222222222, "acc_stderr": 0.0340470532865388, "acc_norm": 0.4722222222222222, "acc_norm_stderr": 0.0340470532865388 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.2549019607843137, "acc_stderr": 0.030587591351604246, "acc_norm": 0.2549019607843137, "acc_norm_stderr": 0.030587591351604246 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.20253164556962025, "acc_stderr": 0.026160568246601457, "acc_norm": 0.20253164556962025, "acc_norm_stderr": 0.026160568246601457 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.10762331838565023, "acc_stderr": 0.020799400082879997, "acc_norm": 0.10762331838565023, "acc_norm_stderr": 0.020799400082879997 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.2824427480916031, "acc_stderr": 0.03948406125768361, "acc_norm": 0.2824427480916031, "acc_norm_stderr": 0.03948406125768361 }, "harness|hendrycksTest-international_law|5": { "acc": 0.18181818181818182, "acc_stderr": 0.035208939510976554, "acc_norm": 0.18181818181818182, "acc_norm_stderr": 0.035208939510976554 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.21296296296296297, "acc_stderr": 0.0395783547198098, "acc_norm": 0.21296296296296297, "acc_norm_stderr": 0.0395783547198098 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.2331288343558282, "acc_stderr": 0.033220157957767414, "acc_norm": 0.2331288343558282, "acc_norm_stderr": 0.033220157957767414 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.16071428571428573, "acc_stderr": 0.03485946096475741, "acc_norm": 0.16071428571428573, "acc_norm_stderr": 0.03485946096475741 }, "harness|hendrycksTest-management|5": { "acc": 0.3786407766990291, "acc_stderr": 0.04802694698258972, "acc_norm": 0.3786407766990291, "acc_norm_stderr": 0.04802694698258972 }, "harness|hendrycksTest-marketing|5": { "acc": 0.19658119658119658, "acc_stderr": 0.02603538609895129, "acc_norm": 0.19658119658119658, "acc_norm_stderr": 0.02603538609895129 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.24, "acc_stderr": 0.04292346959909281, "acc_norm": 0.24, "acc_norm_stderr": 0.04292346959909281 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.20434227330779056, "acc_stderr": 0.0144191239809319, "acc_norm": 0.20434227330779056, "acc_norm_stderr": 0.0144191239809319 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.2138728323699422, "acc_stderr": 0.022075709251757183, "acc_norm": 0.2138728323699422, "acc_norm_stderr": 0.022075709251757183 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.27039106145251396, "acc_stderr": 0.014854993938010102, "acc_norm": 0.27039106145251396, "acc_norm_stderr": 0.014854993938010102 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.2973856209150327, "acc_stderr": 0.02617390850671858, "acc_norm": 0.2973856209150327, "acc_norm_stderr": 0.02617390850671858 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.2540192926045016, "acc_stderr": 0.024723861504771696, "acc_norm": 0.2540192926045016, "acc_norm_stderr": 0.024723861504771696 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.22530864197530864, "acc_stderr": 0.023246202647819746, "acc_norm": 0.22530864197530864, "acc_norm_stderr": 0.023246202647819746 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.2375886524822695, "acc_stderr": 0.025389512552729906, "acc_norm": 0.2375886524822695, "acc_norm_stderr": 0.025389512552729906 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.24771838331160365, "acc_stderr": 0.011025499291443738, "acc_norm": 0.24771838331160365, "acc_norm_stderr": 0.011025499291443738 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.4485294117647059, "acc_stderr": 0.030211479609121593, "acc_norm": 0.4485294117647059, "acc_norm_stderr": 0.030211479609121593 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.2173202614379085, "acc_stderr": 0.01668482092914859, "acc_norm": 0.2173202614379085, "acc_norm_stderr": 0.01668482092914859 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.22727272727272727, "acc_stderr": 0.04013964554072774, "acc_norm": 0.22727272727272727, "acc_norm_stderr": 0.04013964554072774 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.4, "acc_stderr": 0.031362502409358936, "acc_norm": 0.4, "acc_norm_stderr": 0.031362502409358936 }, "harness|hendrycksTest-sociology|5": { "acc": 0.2885572139303483, "acc_stderr": 0.03203841040213321, "acc_norm": 0.2885572139303483, "acc_norm_stderr": 0.03203841040213321 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.26, "acc_stderr": 0.04408440022768078, "acc_norm": 0.26, "acc_norm_stderr": 0.04408440022768078 }, "harness|hendrycksTest-virology|5": { "acc": 0.1927710843373494, "acc_stderr": 0.030709824050565274, "acc_norm": 0.1927710843373494, "acc_norm_stderr": 0.030709824050565274 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.1695906432748538, "acc_stderr": 0.028782108105401712, "acc_norm": 0.1695906432748538, "acc_norm_stderr": 0.028782108105401712 }, "harness|truthfulqa:mc|0": { "mc1": 0.19706242350061198, "mc1_stderr": 0.013925080734473736, "mc2": 0.3401260823172781, "mc2_stderr": 0.014194140794117406 } }

配置详情

  • harness_arc_challenge_25

    • 分割:
      • 2023_09_02T17_29_14.649417
      • latest
    • 路径:
      • **/details_harness|arc:challenge|25_2023-09-02T17:29:14.649417.parquet
  • harness_hellaswag_10

    • 分割:
      • 2023_09_02T17_29_14.649417
      • latest
    • 路径:
      • **/details_harness|hellaswag|10_2023-09-02T17:29:14.649417.parquet
  • harness_hendrycksTest_5

    • 分割:
      • 2023_09_02T17_29_14.649417
    • 路径:
      • **/details_harness|hendrycksTest-abstract_algebra|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-anatomy|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-astronomy|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-business_ethics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-college_biology|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-college_chemistry|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-college_computer_science|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-college_mathematics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-college_medicine|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-college_physics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-computer_security|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-conceptual_physics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-econometrics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-electrical_engineering|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-formal_logic|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-global_facts|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_biology|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_european_history|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_geography|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_physics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_psychology|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_statistics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_us_history|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-high_school_world_history|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-human_aging|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-human_sexuality|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-international_law|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-jurisprudence|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-logical_fallacies|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-machine_learning|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-management|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-marketing|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-medical_genetics|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-miscellaneous|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-moral_disputes|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-moral_scenarios|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-nutrition|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-philosophy|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-prehistory|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-professional_accounting|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-professional_law|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-professional_medicine|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-professional_psychology|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-public_relations|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-security_studies|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-sociology|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-virology|5_2023-09-02T17:29:14.649417.parquet
      • **/details_harness|hendrycksTest-world_religions|5_2023-09-02T17:29:14.649417.parquet
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作