open-llm-leaderboard/details_dhmeltzer__Llama-2-13b-hf-eli5-wiki-1024_r_64_alpha_16_merged

Name: open-llm-leaderboard/details_dhmeltzer__Llama-2-13b-hf-eli5-wiki-1024_r_64_alpha_16_merged
Creator: open-llm-leaderboard
Published: 2023-10-28 16:42:47
License: 暂无描述

Hugging Face2023-10-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/open-llm-leaderboard/details_dhmeltzer__Llama-2-13b-hf-eli5-wiki-1024_r_64_alpha_16_merged

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是在评估模型dhmeltzer/Llama-2-13b-hf-eli5-wiki-1024_r_64_alpha_16_merged时自动创建的，评估是在Open LLM Leaderboard上进行的。数据集由64个配置组成，每个配置对应一个评估任务。数据集是从2次运行中创建的，每次运行都可以在特定配置中找到，分割名称使用运行的时间戳。train分割始终指向最新的结果。此外，还有一个名为results的配置，存储了所有运行的聚合结果，并用于计算和显示Open LLM Leaderboard上的聚合指标。

This dataset was automatically generated during the evaluation of the model dhmeltzer/Llama-2-13b-hf-eli5-wiki-1024_r_64_alpha_16_merged, which was carried out on the Open LLM Leaderboard. The dataset comprises 64 configurations, with each configuration corresponding to one evaluation task. This dataset is compiled from two evaluation runs, each of which is associated with a specific configuration, and the dataset split names are derived from the timestamp of the corresponding run. The 'train' split always references the most recent evaluation results. Additionally, there exists a configuration named 'results' that stores the aggregated results across all runs, and is utilized to compute and display the aggregate metrics on the Open LLM Leaderboard.

提供机构：

open-llm-leaderboard

原始信息汇总

数据集概述

数据集来源

该数据集是在评估模型 dhmeltzer/Llama-2-13b-hf-eli5-wiki-1024_r_64_alpha_16_merged 在 Open LLM Leaderboard 上的运行过程中自动创建的。

数据集结构

配置数量：64个配置，每个配置对应一个评估任务。
数据来源：数据集从2次运行中创建。每次运行可以在每个配置中找到特定的分割，分割名称使用运行的时间戳。
最新结果："train" 分割始终指向最新的结果。
结果汇总：一个额外的配置 "results" 存储所有运行的汇总结果，用于计算和显示在 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_dhmeltzer__Llama-2-13b-hf-eli5-wiki-1024_r_64_alpha_16_merged", "harness_winogrande_5", split="train")

配置详情

以下是数据集的部分配置详情：

配置名称：harness_arc_challenge_25
- 分割：2023_10_03T17_16_44.707859
  - 路径：**/details_harness|arc:challenge|25_2023-10-03T17-16-44.707859.parquet
- 分割：latest
  - 路径：**/details_harness|arc:challenge|25_2023-10-03T17-16-44.707859.parquet
配置名称：harness_drop_3
- 分割：2023_10_28T16_42_34.900797
  - 路径：**/details_harness|drop|3_2023-10-28T16-42-34.900797.parquet
- 分割：latest
  - 路径：**/details_harness|drop|3_2023-10-28T16-42-34.900797.parquet
配置名称：harness_gsm8k_5
- 分割：2023_10_28T16_42_34.900797
  - 路径：**/details_harness|gsm8k|5_2023-10-28T16-42-34.900797.parquet
- 分割：latest
  - 路径：**/details_harness|gsm8k|5_2023-10-28T16-42-34.900797.parquet
配置名称：harness_hellaswag_10
- 分割：2023_10_03T17_16_44.707859
  - 路径：**/details_harness|hellaswag|10_2023-10-03T17-16-44.707859.parquet
- 分割：latest
  - 路径：**/details_harness|hellaswag|10_2023-10-03T17-16-44.707859.parquet
配置名称：harness_hendrycksTest_5
- 分割：2023_10_03T17_16_44.707859
  - 路径：**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-anatomy|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-astronomy|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-business_ethics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-college_biology|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-college_chemistry|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-college_computer_science|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-college_mathematics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-college_medicine|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-college_physics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-computer_security|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-econometrics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-formal_logic|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-global_facts|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_biology|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_geography|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_physics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-human_aging|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-human_sexuality|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-international_law|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-jurisprudence|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-machine_learning|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-management|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-marketing|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-medical_genetics|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-miscellaneous|5_2023-10-03T17-16-44.707859.parquet
  - 路径：**/details_harness|hendrycksTest-moral_disputes|5_2023-10-03T17-16-44.707859.parquet

搜集汇总

数据集介绍

构建方式

在大型语言模型评估领域，数据集构建的严谨性直接决定了评测结果的可信度。本数据集作为开放大语言模型排行榜的衍生成果，其构建过程体现了自动化与标准化的高度融合。它是在对特定模型进行系统性评估时自动生成的，涵盖了64种不同的评测任务配置，每个配置对应一项独立的评估任务。数据来源于两次独立的评估运行，每次运行的结果均以时间戳命名的分割形式存储，确保了数据版本的清晰可追溯。此外，数据集还专门设置了“results”配置，用于汇总所有运行的聚合指标，为排行榜的综合评分提供精确的数据支撑。

特点

该数据集的核心特征在于其多维度的评估覆盖与精细化的结构设计。它广泛囊括了从常识推理、数学解题到专业学科知识等多种类型的评测任务，例如ARC挑战赛、DROP、GSM8K以及涵盖数十个学科的Hendrycks测试等，全面检验模型的多领域认知能力。数据结构上，每个任务配置均包含详细的分割信息，并设有“latest”分割指向最新评估结果，便于用户快速获取当前最优数据。这种设计不仅支持对模型性能的深度剖析，也为横向比较不同模型的优劣提供了丰富而可靠的基准。

使用方法

为有效利用该数据集进行模型评估或研究分析，用户需遵循其特定的数据加载范式。通过Hugging Face的datasets库，可以便捷地调用load_dataset函数，并指定数据集名称、目标评测任务配置（如“harness_winogrande_5”）以及所需的分割（如“train”代表最新结果）来加载相应的数据。加载后的数据通常以结构化格式呈现，包含模型在各项任务上的详细性能指标，例如准确率、F1分数及其标准误。研究人员可据此进行深入的性能分析、误差诊断或跨模型对比，从而推动大语言模型技术的迭代与优化。

背景与挑战

背景概述

在大型语言模型（LLM）评估领域，随着模型架构与训练策略的不断演进，如何系统、客观地衡量模型在多样化任务上的性能成为核心研究议题。该数据集由HuggingFace社区于2023年创建，作为开放LLM排行榜的组成部分，旨在对特定模型dhmeltzer/Llama-2-13b-hf-eli5-wiki-1024_r_64_alpha_16_merged进行多维度评估。其核心研究问题聚焦于量化模型在常识推理、数学解题、知识问答等64项任务上的表现，通过标准化评估流程为社区提供可复现、可比较的性能基准，从而推动LLM技术的透明化发展与迭代优化。

当前挑战

该数据集致力于解决大型语言模型综合能力评估的挑战，其核心在于设计一套能够全面覆盖模型认知与推理维度的评测体系。具体挑战包括：如何选取具有代表性与区分度的评测任务以准确反映模型真实能力；如何确保不同任务间评估标准的一致性与公平性；以及在构建过程中，如何高效整合多轮评估结果并处理海量评测数据，同时维持数据结构的清晰性与可访问性，以支持研究者的深入分析与模型对比。

常用场景

经典使用场景

在大型语言模型评估领域，该数据集作为开放大模型排行榜的评估运行产物，其经典使用场景在于为研究者提供模型性能的细粒度分析。通过涵盖ARC挑战、HellaSwag、Winogrande等多项基准任务，数据集允许用户深入探究模型在常识推理、语言理解及数学问题解决等维度的具体表现，从而为模型优化与比较奠定实证基础。

衍生相关工作

围绕该数据集衍生的经典工作主要集中于评估方法论创新与模型能力图谱构建。例如，基于多任务结果聚合的模型排名体系催生了动态排行榜设计；其细粒度错误分析启发了针对模型偏差的专项研究；同时，该数据集常被引用于模型融合、领域自适应等技术的效果验证中，成为大模型技术演进的重要参照系。

数据集最近研究