five

open-llm-leaderboard-old/details_heegyu__WizardVicuna-Uncensored-3B-0719

收藏
Hugging Face2023-10-19 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_heegyu__WizardVicuna-Uncensored-3B-0719
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是在模型heegyu/WizardVicuna-Uncensored-3B-0719的评估运行期间自动创建的,用于在Open LLM Leaderboard上进行评估。数据集由64个配置组成,每个配置对应一个评估任务。数据集由2次运行生成,每次运行的结果存储为特定配置中的一个分割,分割名称使用运行的时间戳。train分割始终指向最新的结果。此外,还有一个名为results的配置,存储了所有运行的聚合结果,并用于计算和显示Open LLM Leaderboard上的聚合指标。

该数据集是在模型heegyu/WizardVicuna-Uncensored-3B-0719的评估运行期间自动创建的,用于在Open LLM Leaderboard上进行评估。数据集由64个配置组成,每个配置对应一个评估任务。数据集由2次运行生成,每次运行的结果存储为特定配置中的一个分割,分割名称使用运行的时间戳。train分割始终指向最新的结果。此外,还有一个名为results的配置,存储了所有运行的聚合结果,并用于计算和显示Open LLM Leaderboard上的聚合指标。
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集摘要

该数据集是在对模型 heegyu/WizardVicuna-Uncensored-3B-0719 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集组成

数据集由64个配置组成,每个配置对应一个评估任务。数据集从2次运行中创建,每个运行在每个配置中作为一个特定的分割存在,分割名称使用运行的时间戳。"train" 分割始终指向最新的结果。

结果配置

一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_heegyu__WizardVicuna-Uncensored-3B-0719", "harness_winogrande_5", split="train")

最新结果

以下是 2023-10-19T03:10:00.849734 运行的最新结果: python { "all": { "em": 0.0032508389261744967, "em_stderr": 0.0005829486708558908, "f1": 0.05307046979865784, "f1_stderr": 0.0013744215109358906, "acc": 0.32454958283792285, "acc_stderr": 0.008214760837520624 }, "harness|drop|3": { "em": 0.0032508389261744967, "em_stderr": 0.0005829486708558908, "f1": 0.05307046979865784, "f1_stderr": 0.0013744215109358906 }, "harness|gsm8k|5": { "acc": 0.011372251705837756, "acc_stderr": 0.002920666198788741 }, "harness|winogrande|5": { "acc": 0.6377269139700079, "acc_stderr": 0.013508855476252508 } }

配置详情

以下是数据集的配置详情:

  • harness_arc_challenge_25

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|arc:challenge|25_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|arc:challenge|25_2023-07-24T10:29:51.933578.parquet
  • harness_drop_3

    • 分割: 2023_10_19T03_10_00.849734
      • 路径: **/details_harness|drop|3_2023-10-19T03-10-00.849734.parquet
    • 分割: latest
      • 路径: **/details_harness|drop|3_2023-10-19T03-10-00.849734.parquet
  • harness_gsm8k_5

    • 分割: 2023_10_19T03_10_00.849734
      • 路径: **/details_harness|gsm8k|5_2023-10-19T03-10-00.849734.parquet
    • 分割: latest
      • 路径: **/details_harness|gsm8k|5_2023-10-19T03-10-00.849734.parquet
  • harness_hellaswag_10

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hellaswag|10_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hellaswag|10_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径:
        • **/details_harness|hendrycksTest-abstract_algebra|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-anatomy|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-astronomy|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-business_ethics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-college_biology|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-college_chemistry|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-college_computer_science|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-college_mathematics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-college_medicine|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-college_physics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-computer_security|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-conceptual_physics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-econometrics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-electrical_engineering|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-formal_logic|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-global_facts|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_biology|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_european_history|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_geography|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_physics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_psychology|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_statistics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_us_history|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_world_history|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-human_aging|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-human_sexuality|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-international_law|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-jurisprudence|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-logical_fallacies|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-machine_learning|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-management|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-marketing|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-medical_genetics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-miscellaneous|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-moral_disputes|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-moral_scenarios|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-nutrition|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-philosophy|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-prehistory|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-professional_accounting|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-professional_law|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-professional_medicine|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-professional_psychology|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-public_relations|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-security_studies|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-sociology|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-virology|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-world_religions|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径:
        • **/details_harness|hendrycksTest-abstract_algebra|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-anatomy|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-astronomy|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-business_ethics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-college_biology|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-college_chemistry|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-college_computer_science|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-college_mathematics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-college_medicine|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-college_physics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-computer_security|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-conceptual_physics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-econometrics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-electrical_engineering|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-formal_logic|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-global_facts|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_biology|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_european_history|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_geography|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_physics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_psychology|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_statistics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_us_history|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-high_school_world_history|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-human_aging|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-human_sexuality|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-international_law|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-jurisprudence|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-logical_fallacies|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-machine_learning|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-management|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-marketing|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-medical_genetics|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-miscellaneous|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-moral_disputes|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-moral_scenarios|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-nutrition|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-philosophy|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-prehistory|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-professional_accounting|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-professional_law|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-professional_medicine|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-professional_psychology|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-public_relations|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-security_studies|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-sociology|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-virology|5_2023-07-24T10:29:51.933578.parquet
        • **/details_harness|hendrycksTest-world_religions|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_abstract_algebra_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-abstract_algebra|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-abstract_algebra|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_anatomy_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-anatomy|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-anatomy|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_astronomy_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-astronomy|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-astronomy|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_business_ethics_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-business_ethics|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-business_ethics|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_clinical_knowledge_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_college_biology_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-college_biology|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-college_biology|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_college_chemistry_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-college_chemistry|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-college_chemistry|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_college_computer_science_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-college_computer_science|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-college_computer_science|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_college_mathematics_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-college_mathematics|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-college_mathematics|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_college_medicine_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-college_medicine|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-college_medicine|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_college_physics_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-college_physics|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-college_physics|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_computer_security_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-computer_security|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-computer_security|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_conceptual_physics_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-conceptual_physics|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-conceptual_physics|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_econometrics_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-econometrics|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-econometrics|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_electrical_engineering_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-electrical_engineering|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-electrical_engineering|5_2023-07-24T10:29:51.933578.parquet
  • harness_hendrycksTest_elementary_mathematics_5

    • 分割: 2023_07_24T10_29_51.933578
      • 路径: **/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-24T10:29:51.933578.parquet
    • 分割: latest
      • 路径: **/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-24T10:29:51.933578.parquet
搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是在Open LLM Leaderboard框架下,对模型heegyu/WizardVicuna-Uncensored-3B-0719进行自动化评估过程中生成的。数据集由64个配置组成,每个配置对应一项被评估的任务,例如ARC挑战、DROP、GSM8K、HellaSwag以及涵盖57个学科的HendrycksTest等。数据来源于两次独立的评估运行,每次运行的结果以时间戳命名,作为特定配置下的分割(split),而'train'分割始终指向最新一次运行的结果。此外,一个名为'results'的额外配置存储了所有聚合后的评估指标,用于在Leaderboard上计算和展示综合性能。
特点
数据集的核心特点在于其系统化地记录了模型在多维度基准测试上的细粒度表现。每个配置下的分割不仅包含了模型在特定任务上的原始得分(如准确率、F1分数、精确匹配率),还附带了标准误差,为评估结果的可靠性提供了量化依据。通过将不同时间点的运行结果分别存储,该数据集支持对模型性能随时间演变的追踪分析。这种结构化设计使得研究者能够深入探究模型在推理、常识理解、数学计算及专业知识等不同能力维度的优劣,为模型改进提供了精准的反馈。
使用方法
使用该数据集时,可通过Hugging Face的datasets库进行加载。例如,调用load_dataset函数并指定数据集名称及目标配置(如'harness_winogrande_5'),再通过split参数选择所需的时间戳分割或'train'以获取最新结果。数据集以Parquet格式存储,便于高效读取。用户可利用这些细粒度的评估数据复现Leaderboard上的排名结果,或进行自定义的模型性能对比分析。此外,通过解析'results'配置中的聚合指标,可以快速获取模型在全部任务上的综合表现概览。
背景与挑战
背景概述
在大型语言模型(LLM)蓬勃发展的浪潮中,如何系统、公正地评估模型性能成为学界与工业界共同关注的焦点。Open LLM Leaderboard应运而生,由HuggingFace团队于2023年主导创建,旨在为开源社区提供一个标准化、透明化的模型评测平台。该数据集作为其评估运行的一部分,聚焦于heegyu/WizardVicuna-Uncensored-3B-0719这一参数量为3B的对话模型,通过涵盖ARC、HellaSwag、MMLU、GSM8K等多项经典基准任务,深入探究模型在常识推理、数学求解、知识理解等多维度的能力边界。这一评测体系不仅为模型开发者提供了直观的性能参照,更推动了开源LLM社区的良性竞争与迭代优化。
当前挑战
当前,该数据集所反映的核心挑战体现在两个层面。在领域问题层面,尽管模型在Winogrande任务上取得了约63.77%的准确率,但在GSM8K数学推理任务中仅获得1.14%的准确率,揭示了小型模型在复杂推理与符号运算上的显著短板,如何提升其推理鲁棒性是亟待攻克的难题。在构建过程层面,数据集整合了来自不同时间戳的多次运行结果,每次评测可能覆盖不完全相同的任务集合,导致结果的可比性与一致性面临挑战;同时,评测任务的多样性(如MMLU涵盖57个学科)要求数据存储与加载的高效性,而Parquet格式与多配置管理虽缓解了部分问题,但跨任务、跨运行的标准化聚合分析仍是一大技术瓶颈。
常用场景
经典使用场景
在大规模语言模型评估领域,Open LLM Leaderboard上的评估数据集已成为衡量模型综合能力的权威基准。该数据集记录了heegyu/WizardVicuna-Uncensored-3B-0719模型在多个标准测试任务上的表现,涵盖ARC挑战、DROP阅读理解、GSM8K数学推理、HellaSwag常识推理、MMLU多学科知识及WinoGrande代词消歧等核心维度。研究者可通过加载不同任务配置的parquet文件,获取模型在每项任务上的详细得分与误差范围,从而进行模型间横向对比或纵向追踪模型迭代效果。
实际应用
在实际应用中,该数据集为AI模型选型与部署提供了重要参考依据。开发者可通过分析模型在DROP阅读理解任务中的F1值(约0.053)与GSM8K数学推理准确率(约0.011),快速判断该3B参数模型在复杂推理场景下的适用性边界。例如,在构建知识问答系统时,可依据WinoGrande任务63.8%的准确率评估模型处理代词指代歧义的能力,从而决定是否需要引入额外的消歧模块。这种基于实证数据的决策模式显著降低了模型落地时的试错成本。
衍生相关工作
该数据集衍生了一系列关于小参数模型能力边界探索的经典工作。研究者基于其公开的评估结果,深入分析了WizardVicuna-Uncensored-3B在未审查场景下的知识保留与安全对齐问题,催生了针对3B规模模型的细粒度安全评估方法。同时,数据集中的多任务得分矩阵为模型压缩与知识蒸馏研究提供了宝贵的基线参考,推动了诸如任务特定微调策略、混合精度训练优化等技术的改进。此外,其公开的评估代码与数据流水线已成为后续LLM评估框架的标准参考实现。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作