open-llm-leaderboard/details_lmsys__longchat-7b-v1.5-32k

Name: open-llm-leaderboard/details_lmsys__longchat-7b-v1.5-32k
Creator: open-llm-leaderboard
Published: 2023-10-16 16:20:45
License: 暂无描述

Hugging Face2023-10-16 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/open-llm-leaderboard/details_lmsys__longchat-7b-v1.5-32k

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Evaluation run of lmsys/longchat-7b-v1.5-32k dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [lmsys/longchat-7b-v1.5-32k](https://huggingface.co/lmsys/longchat-7b-v1.5-32k)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 3 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the agregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_lmsys__longchat-7b-v1.5-32k\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-10-16T16:20:33.188247](https://huggingface.co/datasets/open-llm-leaderboard/details_lmsys__longchat-7b-v1.5-32k/blob/main/results_2023-10-16T16-20-33.188247.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"em\": 0.08252936241610738,\n\ \ \"em_stderr\": 0.0028179934761829416,\n \"f1\": 0.1372829278523486,\n\ \ \"f1_stderr\": 0.0030245592633561815,\n \"acc\": 0.3672124310289838,\n\ \ \"acc_stderr\": 0.009455449816488642\n },\n \"harness|drop|3\": {\n\ \ \"em\": 0.08252936241610738,\n \"em_stderr\": 0.0028179934761829416,\n\ \ \"f1\": 0.1372829278523486,\n \"f1_stderr\": 0.0030245592633561815\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.047763457164518575,\n \ \ \"acc_stderr\": 0.005874387536229305\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.6866614048934491,\n \"acc_stderr\": 0.01303651209674798\n\ \ }\n}\n```" repo_url: https://huggingface.co/lmsys/longchat-7b-v1.5-32k leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_drop_3 data_files: - split: 2023_10_16T16_20_33.188247 path: - '**/details_harness|drop|3_2023-10-16T16-20-33.188247.parquet' - split: latest path: - '**/details_harness|drop|3_2023-10-16T16-20-33.188247.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_10_16T16_20_33.188247 path: - '**/details_harness|gsm8k|5_2023-10-16T16-20-33.188247.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-10-16T16-20-33.188247.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_10_16T16_20_33.188247 path: - '**/details_harness|winogrande|5_2023-10-16T16-20-33.188247.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-10-16T16-20-33.188247.parquet' - config_name: results data_files: - split: 2023_10_16T16_20_33.188247 path: - results_2023-10-16T16-20-33.188247.parquet - split: latest path: - results_2023-10-16T16-20-33.188247.parquet --- # Dataset Card for Evaluation run of lmsys/longchat-7b-v1.5-32k ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/lmsys/longchat-7b-v1.5-32k - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [lmsys/longchat-7b-v1.5-32k](https://huggingface.co/lmsys/longchat-7b-v1.5-32k) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 3 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the agregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_lmsys__longchat-7b-v1.5-32k", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-10-16T16:20:33.188247](https://huggingface.co/datasets/open-llm-leaderboard/details_lmsys__longchat-7b-v1.5-32k/blob/main/results_2023-10-16T16-20-33.188247.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "em": 0.08252936241610738, "em_stderr": 0.0028179934761829416, "f1": 0.1372829278523486, "f1_stderr": 0.0030245592633561815, "acc": 0.3672124310289838, "acc_stderr": 0.009455449816488642 }, "harness|drop|3": { "em": 0.08252936241610738, "em_stderr": 0.0028179934761829416, "f1": 0.1372829278523486, "f1_stderr": 0.0030245592633561815 }, "harness|gsm8k|5": { "acc": 0.047763457164518575, "acc_stderr": 0.005874387536229305 }, "harness|winogrande|5": { "acc": 0.6866614048934491, "acc_stderr": 0.01303651209674798 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]

提供机构：

open-llm-leaderboard

原始信息汇总

数据集卡片 for Evaluation run of lmsys/longchat-7b-v1.5-32k

数据集描述

数据集概述

该数据集是在模型 lmsys/longchat-7b-v1.5-32k 在 Open LLM Leaderboard 上的评估运行期间自动创建的。

数据集由3个配置组成，每个配置对应一个评估任务。

数据集从1次运行中创建。每次运行可以在每个配置中作为一个特定的分割找到，分割名称使用运行的时戳。"train" 分割始终指向最新的结果。

一个额外的配置 "results" 存储所有运行的聚合结果（并用于计算和显示 Open LLM Leaderboard 上的聚合指标）。

要加载运行的详细信息，可以执行以下操作： python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_lmsys__longchat-7b-v1.5-32k", "harness_winogrande_5", split="train")

最新结果

以下是 2023-10-16T16:20:33.188247 运行的最新结果：

python { "all": { "em": 0.08252936241610738, "em_stderr": 0.0028179934761829416, "f1": 0.1372829278523486, "f1_stderr": 0.0030245592633561815, "acc": 0.3672124310289838, "acc_stderr": 0.009455449816488642 }, "harness|drop|3": { "em": 0.08252936241610738, "em_stderr": 0.0028179934761829416, "f1": 0.1372829278523486, "f1_stderr": 0.0030245592633561815 }, "harness|gsm8k|5": { "acc": 0.047763457164518575, "acc_stderr": 0.005874387536229305 }, "harness|winogrande|5": { "acc": 0.6866614048934491, "acc_stderr": 0.01303651209674798 } }

搜集汇总

数据集介绍

构建方式

在大型语言模型评估的蓬勃发展中，Open LLM Leaderboard 作为衡量模型性能的重要基准，催生了诸多评估数据集。本数据集正是为记录 lmsys/longchat-7b-v1.5-32k 模型在 Leaderboard 上的评估过程而自动生成。其构建方式基于单一评估运行，该运行覆盖三项任务，对应数据集中的三个配置（harness_drop_3、harness_gsm8k_5、harness_winogrande_5）。每次运行的结果以时间戳命名，作为各配置下的独立分割，而“train”分割则始终指向最新运行结果。此外，数据集另设“results”配置，用于存储所有聚合指标，从而支撑 Leaderboard 上综合分数的计算与展示。

使用方法

使用者可通过 HuggingFace Datasets 库轻松加载数据。例如，执行 `load_dataset("open-llm-leaderboard/details_lmsys__longchat-7b-v1.5-32k", "harness_winogrande_5", split="train")` 即可获取 Winogrande 任务的最新评估详情。若需访问特定历史运行，可将分割参数替换为对应时间戳（如“2023_10_16T16_20_33.188247”）。对于聚合指标，加载“results”配置下的“latest”分割即可获得所有任务的综合结果。数据以 Parquet 格式存储，支持高效读取，适用于模型性能追踪、基准测试复现及后续分析工作流。

背景与挑战

背景概述

随着大语言模型在自然语言处理领域的迅猛发展，如何系统性地评估模型在多样化任务上的表现成为了一个关键议题。Open LLM Leaderboard由HuggingFace团队于2023年创建，旨在为开源语言模型提供标准化、可复现的评估基准，Clementine Fourrier等人是其核心维护者。该数据集记录了lmsys/longchat-7b-v1.5-32k模型在2023年10月16日进行的评测结果，涵盖了DROP、GSM8K和WinoGrande三项任务，分别对应阅读理解、数学推理和指代消解等核心能力。作为Open LLM Leaderboard的组成部分，该数据集不仅为模型性能对比提供了透明化的依据，还推动了社区对长上下文模型能力的量化认知，对后续研究具有重要参考价值。

当前挑战

该数据集所面临的核心挑战体现在两个层面。在领域问题层面，长上下文模型在DROP任务中仅取得8.25%的精确匹配率和13.73%的F1分数，暴露出模型在复杂推理与多跳问答上的显著局限；GSM8K上4.78%的准确率更凸显了数学推理能力的薄弱，这是当前大语言模型普遍存在的瓶颈。在构建过程中，数据集需处理多任务评测结果的结构化存储，确保不同时间戳的评测轮次可追溯且最新结果能自动指向特定split，这对数据版本管理和一致性提出了较高要求；此外，评测指标需跨任务统一聚合以展示综合性能，但不同任务间评估标准（如EM与Acc）的异质性增加了结果解释的复杂性。

常用场景

经典使用场景

在大型语言模型评估的学术版图中，open-llm-leaderboard/details_lmsys__longchat-7b-v1.5-32k数据集承载着对长上下文对话模型lmsys/longchat-7b-v1.5-32k进行标准化性能评测的使命。该数据集通过整合HuggingFace Open LLM Leaderboard框架中的三项经典任务——DROP阅读理解、GSM8K数学推理与WinoGrande代词消歧，构建起多维度的评估体系。研究者可借助其预设的config配置与时间戳分割，精确复现模型在每项任务上的细粒度表现，从而深入剖析长上下文机制对语言理解、数学逻辑与常识推理能力的差异化影响。这一设计使得该数据集成为验证长序列Transformer架构在标准基准上泛化能力的典范工具。

解决学术问题

该数据集系统性地回应了长上下文语言模型评估中缺乏统一基准的学术挑战。具体而言，它通过对longchat-7b-v1.5-32k在DROP（精确匹配与F1分数）、GSM8K（准确率）与WinoGrande（准确率）三项任务上的量化记录，揭示了长序列建模在复杂推理场景下的真实效能边界。研究界得以借此数据洞悉：尽管模型在WinoGrande上展现出接近70%的准确率，但在GSM8K上仅不足5%的表现凸显了长上下文对数学推理的潜在干扰。这些发现推动了关于上下文长度与任务难度交互作用的系统性探讨，为后续改进长序列Transformer的训练策略与注意力机制提供了不可或缺的实证参考。

实际应用

在实际应用层面，该数据集为长上下文对话系统的部署提供了关键的可靠性验证依据。例如，在智能客服、文档问答与教育辅导等场景中，模型需要同时处理长篇幅用户输入与精确推理需求——longchat-7b-v1.5-32k在DROP任务上0.137的F1分数与GSM8K上的低准确率，警示开发者需谨慎将此类模型直接应用于需要高精度数学计算的实时场景。反之，其在WinoGrande上的相对稳健表现则暗示了在代词消歧类任务中的潜力。因此，该数据集可指导工程师根据具体业务需求（如偏向常识理解还是逻辑推理）有针对性地选择模型或进行领域微调，从而降低长上下文模型落地时的风险。

数据集最近研究