open-llm-leaderboard/details_KoboldAI__GPT-J-6B-Adventure

Name: open-llm-leaderboard/details_KoboldAI__GPT-J-6B-Adventure
Creator: open-llm-leaderboard
Published: 2023-10-23 18:47:31
License: 暂无描述

Hugging Face2023-10-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/open-llm-leaderboard/details_KoboldAI__GPT-J-6B-Adventure

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是在Open LLM Leaderboard上对KoboldAI/GPT-J-6B-Adventure模型进行评估时自动创建的。数据集由64个配置组成，每个配置对应一个评估任务。数据集由2次运行生成，每次运行的结果存储为特定配置中的一个分割，分割名称使用运行的时间戳。train分割始终指向最新的结果。此外，还有一个名为results的配置，存储了所有运行的聚合结果，并用于在Open LLM Leaderboard上计算和显示聚合指标。

This dataset was automatically created during the evaluation of the KoboldAI/GPT-J-6B-Adventure model on the Open LLM Leaderboard. The dataset consists of 64 configurations, each corresponding to one evaluation task. The dataset is generated from two evaluation runs, with the results of each run stored as a split under its specific configuration, where the split name uses the timestamp of the run. The "train" split always points to the most recent results. In addition, there is a configuration named "results" that stores the aggregated results of all runs and is used to calculate and display the aggregate metrics on the Open LLM Leaderboard.

提供机构：

open-llm-leaderboard

原始信息汇总

数据集概述

数据集来源

该数据集是在对模型 KoboldAI/GPT-J-6B-Adventure 进行评估运行期间自动创建的，评估结果展示在 Open LLM Leaderboard 上。

数据集结构

数据集包含 64 个配置，每个配置对应一个评估任务。
数据集从 2 次运行中创建，每次运行在每个配置中作为一个特定的分片存在，分片名称使用运行的时间戳。
"train" 分片始终指向最新的结果。
一个额外的配置 "results" 存储所有运行的聚合结果，用于计算和显示在 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_KoboldAI__GPT-J-6B-Adventure", "harness_winogrande_5", split="train")

配置详情

以下是数据集的配置详情：

harness_arc_challenge_25
- 分片：2023_07_19T15_46_02.673119
  - 路径：**/details_harness|arc:challenge|25_2023-07-19T15:46:02.673119.parquet
- 分片：latest
  - 路径：**/details_harness|arc:challenge|25_2023-07-19T15:46:02.673119.parquet
harness_drop_3
- 分片：2023_10_23T18_47_17.951129
  - 路径：**/details_harness|drop|3_2023-10-23T18-47-17.951129.parquet
- 分片：latest
  - 路径：**/details_harness|drop|3_2023-10-23T18-47-17.951129.parquet
harness_gsm8k_5
- 分片：2023_10_23T18_47_17.951129
  - 路径：**/details_harness|gsm8k|5_2023-10-23T18-47-17.951129.parquet
- 分片：latest
  - 路径：**/details_harness|gsm8k|5_2023-10-23T18-47-17.951129.parquet
harness_hellaswag_10
- 分片：2023_07_19T15_46_02.673119
  - 路径：**/details_harness|hellaswag|10_2023-07-19T15:46:02.673119.parquet
- 分片：latest
  - 路径：**/details_harness|hellaswag|10_2023-07-19T15:46:02.673119.parquet
harness_hendrycksTest_5
- 分片：2023_07_19T15_46_02.673119
  - 路径：**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-anatomy|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-astronomy|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-business_ethics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-college_biology|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-college_medicine|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-college_physics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-computer_security|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-econometrics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-formal_logic|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-global_facts|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-human_aging|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-international_law|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-machine_learning|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-management|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-marketing|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-nutrition|5_2023-07-19T15:46:02.673119.parquet
  - 路径：**/details_harness|hendrycksTest-philosophy|5_2023-07-19T15:46:02.673119.parquet

搜集汇总

数据集介绍

构建方式

在大型语言模型评测领域，Open LLM Leaderboard 作为权威基准平台，系统性地评估了各类模型的性能。该数据集正是在此框架下，针对 KoboldAI/GPT-J-6B-Adventure 模型自动生成的评测记录。其构建过程依托于两次独立的评测运行，每次运行的结果被分别存储为数据集中以时间戳命名的特定分割（split），而 'train' 分割则始终指向最新一次评测的产出。数据集由 64 个配置（configuration）组成，每个配置精准对应一个被评测的具体任务，例如 ARC Challenge、DROP 或 GSM8K。此外，一个名为 'results' 的额外配置被用来汇总所有任务的聚合指标，为排行榜上的综合评分提供数据基础。

特点

该数据集最显著的特点在于其结构化的时间序列组织方式。每一次评测运行都被完整保留，通过时间戳分割实现了结果的可追溯性与版本控制，便于研究者对比模型在不同时间点的性能演变。每个任务配置独立存储，涵盖了从常识推理（如 Winogrande）到数学求解（如 GSM8K）再到多领域知识问答（如 MMLU 的 57 个子任务）的广泛评测维度。数据集以 Parquet 格式存储，高效且兼容性强，同时提供了精确的评估指标，包括准确率（acc）、精确匹配率（em）及 F1 分数，并附有标准误差，确保了评测结果的严谨性与可复现性。

使用方法

研究者可通过 Hugging Face 的 datasets 库便捷地加载并使用该数据集。具体而言，调用 load_dataset 函数时，需指定数据集的完整标识符 'open-llm-leaderboard/details_KoboldAI__GPT-J-6B-Adventure'，并选择目标任务的配置名称（如 'harness_winogrande_5'）以及所需的分割（如 'latest' 或特定时间戳）。例如，'split="train"' 将加载最新评测的结果。加载后的数据可直接用于分析模型在特定任务上的细粒度表现，或通过 'results' 配置获取聚合指标，为模型性能的横向比较与深入诊断提供坚实的数据支撑。

背景与挑战

背景概述

在大型语言模型（LLM）迅猛发展的浪潮中，如何系统化、标准化地评估模型性能已成为学术界与工业界共同关注的核心议题。Open LLM Leaderboard由HuggingFace团队于2023年创建，旨在通过统一的多任务基准测试框架，为开源社区提供一个透明、可复现的模型评估平台。该数据集正是为评估KoboldAI/GPT-J-6B-Adventure模型而自动生成的评测记录，涵盖ARC挑战、DROP、GSM8K、HellaSwag、WinoGrande及涵盖57个学科的MMLU等多元化任务。作为评测基础设施的关键组成部分，该数据集不仅记录了模型在各项任务上的精确指标，还通过多轮运行的时间戳分片保留了评估过程的可追溯性，为后续研究提供了宝贵的参考基准，推动了LLM评估范式的标准化进程。

当前挑战

该数据集所应对的核心挑战在于构建一个全面且公平的LLM性能度量体系。在领域问题层面，现有评测常因任务单一或数据泄露而难以真实反映模型泛化能力，因此该数据集整合了从常识推理（如HellaSwag）到数学解题（如GSM8K）、从阅读理解（如DROP）到对抗性推理（如ARC）的多样化任务，以克服单一基准的局限性。在构建过程中，技术挑战尤为突出：需确保64个配置项与多次运行结果间的数据一致性，处理不同任务指标（如精确匹配、F1分数、准确率）的异构存储与聚合，并通过Parquet格式的高效压缩与分片策略，在保证数据完整性的同时实现大规模评测结果的可复现与可扩展，这些设计为自动化评估流水线的可靠性奠定了坚实基础。

常用场景

经典使用场景

在大型语言模型评估领域，Open LLM Leaderboard的评测数据构成了模型性能对比的基石。该数据集专为KoboldAI/GPT-J-6B-Adventure模型设计，通过64个任务配置涵盖ARC挑战、DROP、GSM8K、HellaSwag及HendrycksTest等多样化基准，系统性地记录模型在常识推理、数学解题、文本理解与多学科知识上的表现。研究者可借助此数据集复现模型在特定任务上的详细得分，或将其作为基准线，在相同评测框架下横向比较不同模型的优劣，从而推动语言模型能力的量化研究。

实际应用

在实际应用中，该数据集可服务于模型选型与部署决策。开发人员能依据其记录的详细评测结果，如Winogrande上的55.96%准确率，判断GPT-J-6B-Adventure在代词消歧任务中的适用性，从而在聊天机器人或文本生成系统中合理配置模型。此外，企业可将其作为质量保障工具，在模型迭代过程中参考历史评测数据，确保新版本在关键指标上不退化，实现从实验室到产品线的无缝衔接。

衍生相关工作

基于该数据集，衍生出多项重要工作。Open LLM Leaderboard本身即是一个持续更新的评测平台，其数据格式被后续模型如LLaMA和Falcon的评估所沿用。研究者还利用其中的详细任务拆分，开发了针对特定能力（如数学推理）的细粒度分析工具，并催生了如‘Harness’评测框架的优化版本，后者通过标准化接口降低了新模型接入门槛，推动了语言模型评测生态的繁荣。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集