tinyBenchmarks/tinyHellaswag

Name: tinyBenchmarks/tinyHellaswag
Creator: tinyBenchmarks
Published: 2024-05-25 10:44:12
License: 暂无描述

Hugging Face2024-05-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/tinyBenchmarks/tinyHellaswag

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: ind dtype: int32 - name: activity_label dtype: string - name: ctx_a dtype: string - name: ctx_b dtype: string - name: ctx dtype: string - name: endings sequence: string - name: source_id dtype: string - name: split dtype: string - name: split_type dtype: string - name: label dtype: string - name: input_formatted dtype: string splits: - name: train num_bytes: 160899446 num_examples: 39905 - name: test num_bytes: 40288101 num_examples: 10003 - name: validation num_bytes: 473652 num_examples: 100 download_size: 50109798 dataset_size: 201661199 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: validation path: data/validation-* language: - en pretty_name: tinyHellaswag size_categories: - n<1K multilinguality: - monolingual source_datasets: - Rowan/hellaswag language_bcp47: - en-US --- # tinyHellaswag Welcome to tinyHellaswag! This dataset serves as a concise version of the [hellaswag](https://huggingface.co/datasets/hellaswag) dataset, offering a subset of 100 data points selected from the original compilation. tinyHellaswag is designed to enable users to efficiently estimate the performance of a large language model (LLM) with reduced dataset size, saving computational resources while maintaining the essence of the hellaswag evaluation. ## Features - **Compact Dataset:** With only 100 data points, tinyHellaswag provides a swift and efficient way to evaluate your LLM's performance against a benchmark set, maintaining the essence of the original hellaswag dataset. - **Compatibility:** tinyHellaswag is compatible with evaluation using the [lm evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness/), but can also be integrated into your custom pipeline. See below for more details. ## Model Evaluation _With lm-eval harness_ Users looking to evaluate a new model with tinyHellaswag can use the [lm evaluation harness (v0.4.1 or later)](https://github.com/EleutherAI/lm-evaluation-harness/). To do so, you can directly run your evaluation harness with `--tasks=tinyHellaswag`: ```shell lm_eval --model hf --model_args pretrained="<your-model>" --tasks=tinyHellaswag --batch_size=1 ``` LM-eval harness will directly output the best accuracy estimator (IRT++), without any additional steps required. _Without lm-eval harness_ Alternatively, tinyHellaswag can be integrated into any other pipeline by downloading the data via ```python from datasets import load_dataset tiny_data = load_dataset('tinyBenchmarks/tinyHellaswag')['validation'] ``` Now, `tiny_data` contains the 100 subsampled data points with the same features as the original dataset, as well as an additional field containing the preformatted data points. The preformatted data points follow the formatting used in the [open llm leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) including the respective in-context examples. You can then estimate your LLM's performance using the following code. First, ensure you have the tinyBenchmarks package installed: ```shell pip install git+https://github.com/felipemaiapolo/tinyBenchmarks ``` Then, use the code snippet below for the evaluation: ```python import numpy as np import tinyBenchmarks as tb ### Score vector y = # your original score vector ### Parameters benchmark = 'hellaswag' ### Evaluation tb.evaluate(y, benchmark) ``` This process will help you estimate the performance of your LLM against the tinyHellaswag dataset, providing a streamlined approach to benchmarking. Please be aware that evaluating on multiple GPUs can change the order of outputs in the lm evaluation harness. Ordering your score vector following the original order in tinyHellaswag will be necessary to use the tinyBenchmarks library. For more detailed instructions on evaluating new models and computing scores, please refer to the comprehensive guides available at [lm evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness/) and [tinyBenchmarks GitHub](https://github.com/felipemaiapolo/tinyBenchmarks). Happy benchmarking! ## More tinyBenchmarks **Open LLM leaderboard**: [tiny MMLU](https://huggingface.co/datasets/tinyBenchmarks/tinyMMLU), [tiny Arc-Challenge](https://huggingface.co/datasets/tinyBenchmarks/tinyAI2_arc), [tiny Winogrande](https://huggingface.co/datasets/tinyBenchmarks/tinyWinogrande), [tiny TruthfulQA](https://huggingface.co/datasets/tinyBenchmarks/tinyTruthfulQA), [tiny GSM8k](https://huggingface.co/datasets/tinyBenchmarks/tinyGSM8k) **AlpacaEval**: [tiny AlpacaEval](https://huggingface.co/datasets/tinyBenchmarks/tinyAlpacaEval) **HELM-lite**: _work-in-progress_ ## Citation @article{polo2024tinybenchmarks, title={tinyBenchmarks: evaluating LLMs with fewer examples}, author={Felipe Maia Polo and Lucas Weber and Leshem Choshen and Yuekai Sun and Gongjun Xu and Mikhail Yurochkin}, year={2024}, eprint={2402.14992}, archivePrefix={arXiv}, primaryClass={cs.CL} } @inproceedings{zellers2019hellaswag, title={HellaSwag: Can a Machine Really Finish Your Sentence?}, author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin}, booktitle ={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, year={2019} }

dataset_info: 数据集信息: features: - 字段名: ind 数据类型: int32 - 字段名: activity_label 数据类型: 字符串 - 字段名: ctx_a 数据类型: 字符串 - 字段名: ctx_b 数据类型: 字符串 - 字段名: ctx 数据类型: 字符串 - 字段名: endings 数据类型: 字符串序列 - 字段名: source_id 数据类型: 字符串 - 字段名: split 数据类型: 字符串 - 字段名: split_type 数据类型: 字符串 - 字段名: label 数据类型: 字符串 - 字段名: input_formatted 数据类型: 字符串 splits: - 划分集名称: train 字节数: 160899446 样本数: 39905 - 划分集名称: test 字节数: 40288101 样本数: 10003 - 划分集名称: validation 字节数: 473652 样本数: 100 下载大小: 50109798 数据集总大小: 201661199 configs: - 配置名称: default 数据文件: - 划分集: train 路径: data/train-* - 划分集: test 路径: data/test-* - 划分集: validation 路径: data/validation-* 语言: - en 显示名称: tinyHellaswag 样本规模类别: - n<1K 多语言属性: - 单语言源数据集: - Rowan/hellaswag 语言BCP47标签: - en-US --- # tinyHellaswag 欢迎来到tinyHellaswag！本数据集是[hellaswag](https://huggingface.co/datasets/hellaswag)数据集的精简版本，从原始数据集集合中选取100条样本作为其子集。 tinyHellaswag旨在帮助用户以更小的数据集规模高效评估大语言模型（Large Language Model，LLM）的性能，在节省计算资源的同时保留了hellaswag评测的核心逻辑。 ## 数据集特性 - **轻量精简**：仅包含100条样本，tinyHellaswag可实现大语言模型在基准测试集上的快速高效评估，完整保留原始hellaswag数据集的评测核心。 - **兼容性强**：tinyHellaswag兼容[lm评估工具包（lm evaluation harness）](https://github.com/EleutherAI/lm-evaluation-harness/)的评测流程，同时也可集成至自定义评测管线中，详情见下文。 ## 模型评测 ### 使用lm-eval评测框架希望使用tinyHellaswag评测新模型的用户可直接使用[lm评估工具包（v0.4.1及以上版本）](https://github.com/EleutherAI/lm-evaluation-harness/)。具体操作可直接通过以下命令运行评测管线： shell lm_eval --model hf --model_args pretrained="<your-model>" --tasks=tinyHellaswag --batch_size=1 lm评估工具包将直接输出最优准确率估计值（IRT++），无需额外操作步骤。 ### 不使用lm-eval评测框架或者，你也可以通过下载数据的方式将tinyHellaswag集成至任意其他评测管线中，代码示例如下： python from datasets import load_dataset tiny_data = load_dataset('tinyBenchmarks/tinyHellaswag')['validation'] 此时，`tiny_data`将包含100条下采样后的样本，其特征与原始数据集一致，同时新增了一个包含格式化后样本的字段。格式化后的样本遵循[开放大语言模型排行榜](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)所采用的格式，包含相应的上下文示例。你可通过以下代码估算你的大语言模型性能。首先，请确保已安装tinyBenchmarks工具包： shell pip install git+https://github.com/felipemaiapolo/tinyBenchmarks 随后，使用以下代码片段完成评测： python import numpy as np import tinyBenchmarks as tb ### 得分向量 y = # 你的原始得分向量 ### 参数设置 benchmark = 'hellaswag' ### 评测执行 tb.evaluate(y, benchmark) 该流程可帮助你基于tinyHellaswag数据集估算大语言模型的性能，提供了一种轻量化的基准测试方案。请注意，在多GPU环境下进行评测时，lm评估工具包的输出顺序可能发生变化。因此，若需使用tinyBenchmarks库，需确保你的得分向量与tinyHellaswag原始样本顺序保持一致。如需了解评测新模型与计算得分的更多详细指南，请参考[lm评估工具包](https://github.com/EleutherAI/lm-evaluation-harness/)与[tinyBenchmarks GitHub仓库](https://github.com/felipemaiapolo/tinyBenchmarks)的官方文档。祝您基准测试顺利！ ## 更多tinyBenchmarks数据集 ### 开放大语言模型排行榜系列： [tiny MMLU](https://huggingface.co/datasets/tinyBenchmarks/tinyMMLU), [tiny Arc-Challenge](https://huggingface.co/datasets/tinyBenchmarks/tinyAI2_arc), [tiny Winogrande](https://huggingface.co/datasets/tinyBenchmarks/tinyWinogrande), [tiny TruthfulQA](https://huggingface.co/datasets/tinyBenchmarks/tinyTruthfulQA), [tiny GSM8k](https://huggingface.co/datasets/tinyBenchmarks/tinyGSM8k) ### AlpacaEval系列： [tiny AlpacaEval](https://huggingface.co/datasets/tinyBenchmarks/tinyAlpacaEval) ### HELM-lite系列：开发中 ## 引用格式 bibtex @article{polo2024tinybenchmarks, title={tinyBenchmarks: evaluating LLMs with fewer examples}, author={Felipe Maia Polo and Lucas Weber and Leshem Choshen and Yuekai Sun and Gongjun Xu and Mikhail Yurochkin}, year={2024}, eprint={2402.14992}, archivePrefix={arXiv}, primaryClass={cs.CL} } @inproceedings{zellers2019hellaswag, title={HellaSwag: Can a Machine Really Finish Your Sentence?}, author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin}, booktitle ={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, year={2019} }

提供机构：

tinyBenchmarks

原始信息汇总

tinyHellaswag 数据集概述

数据集信息

特征

ind: 类型为 int32
activity_label: 类型为 string
ctx_a: 类型为 string
ctx_b: 类型为 string
ctx: 类型为 string
endings: 类型为 sequence of string
source_id: 类型为 string
split: 类型为 string
split_type: 类型为 string
label: 类型为 string
input_formatted: 类型为 string

数据分割

train: 字节数为 160899446，样本数为 39905
test: 字节数为 40288101，样本数为 10003
validation: 字节数为 473652，样本数为 100

数据集大小

下载大小: 50109798 字节
数据集大小: 201661199 字节

配置

default:
- train: 路径为 data/train-*
- test: 路径为 data/test-*
- validation: 路径为 data/validation-*

语言

名称

pretty_name: tinyHellaswag

大小类别

n<1K

多语言性

monolingual

源数据集

Rowan/hellaswag

语言 BCP47

en-US

数据集描述

tinyHellaswag 是 hellaswag 数据集的一个精简版本，包含 100 个数据点，旨在通过减少数据集大小来高效评估大型语言模型（LLM）的性能，同时保持 hellaswag 评估的核心要素。

特点

紧凑数据集: 仅包含 100 个数据点，提供了一种快速高效的方式来评估 LLM 的性能。
兼容性: 可与 lm evaluation harness 一起使用，也可集成到自定义管道中。

模型评估

使用 lm-eval harness: 用户可以通过运行评估工具并指定 --tasks=tinyHellaswag 来评估新模型。
不使用 lm-eval harness: 可以通过下载数据并使用 tinyBenchmarks 库进行评估。

引用

plaintext @article{polo2024tinybenchmarks, title={tinyBenchmarks: evaluating LLMs with fewer examples}, author={Felipe Maia Polo and Lucas Weber and Leshem Choshen and Yuekai Sun and Gongjun Xu and Mikhail Yurochkin}, year={2024}, eprint={2402.14992}, archivePrefix={arXiv}, primaryClass={cs.CL} } @inproceedings{zellers2019hellaswag, title={HellaSwag: Can a Machine Really Finish Your Sentence?}, author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin}, booktitle ={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, year={2019} }

搜集汇总

数据集介绍

构建方式

在自然语言处理领域，为高效评估大语言模型的常识推理能力，tinyHellaswag数据集应运而生。该数据集源自经典的Hellaswag基准，通过精心筛选出100个代表性样本构建而成，保留了原始数据集的上下文补全任务框架。构建过程中，研究者从原始数据中提取了包含活动标签、上下文片段及多个候选结尾的结构化数据，并添加了预格式化字段，确保其既能反映原任务本质，又大幅缩减了数据规模。

特点

该数据集的核心特点在于其极致的紧凑性与高度的代表性。尽管仅包含100个样本，tinyHellaswag却完整继承了Hellaswag数据集的评估范式，涵盖多样化的日常活动场景与复杂的上下文推理。其数据结构清晰，除原始特征外，还提供了符合开放大模型排行榜标准的预格式化输入，便于直接集成至评估流程。这种设计在极大降低计算开销的同时，依然能够为模型性能提供可靠的趋势性估计。

使用方法

使用该数据集进行模型评估时，研究者拥有灵活的选择。既可通过lm-evaluation-harness工具，以命令行方式直接调用tinyHellaswag任务进行自动化评测，系统将自动输出最佳准确率估计值。亦可利用datasets库加载数据，结合tinyBenchmarks专用评估包进行集成分析。后者要求用户按原始顺序提供模型的得分向量，通过调用封装好的评估函数，即可获得基于子采样的性能估计，从而实现快速、资源友好的基准测试。

背景与挑战

背景概述

在自然语言处理领域，常识推理能力是评估大型语言模型智能水平的核心维度之一。HellaSwag数据集由Rowan Zellers等研究人员于2019年创建，旨在挑战模型在完成句子任务中的上下文推理能力，其通过构建对抗性干扰项，要求模型从多个选项中选出最合理的结局。该数据集迅速成为衡量模型常识理解与推理性能的重要基准，推动了语言模型在复杂语义场景下的评估研究。tinyHellaswag作为其精简版本，由Felipe Maia Polo等人于2024年发布，通过精心筛选100个样本，为研究者提供了高效评估模型性能的轻量化工具，显著降低了计算成本，同时保持了原数据集的核心评估特性。

当前挑战

HellaSwag数据集所针对的领域挑战在于，传统语言模型往往在表面语言模式匹配上表现优异，却难以深入理解日常活动中的物理常识与社会情境逻辑。该数据集通过设计基于视频描述的对抗性样本，迫使模型超越浅层统计关联，进行深层次的因果与时空推理。在构建过程中，挑战主要集中于如何从大规模视频字幕数据中自动生成既符合语法多样性又富含推理难度的干扰项，同时确保选项间的细微差异能有效区分模型的真实推理能力与记忆偏差。tinyHellaswag的构建则需在极小子集上保持原数据集的统计代表性与评估效度，这对采样策略的稳健性提出了更高要求。

常用场景

经典使用场景

在自然语言处理领域，常识推理能力的评估是衡量模型智能水平的关键维度。tinyHellaswag作为Hellaswag数据集的精简版本，其经典使用场景聚焦于高效评估大型语言模型在上下文情境中完成句子的能力。通过提供100个精心筛选的数据点，该数据集使研究者能够在有限计算资源下，快速测试模型对日常活动描述的推理准确性，为模型性能提供可靠的初步估计。

衍生相关工作

围绕该数据集衍生的经典工作主要包括评估框架的优化与扩展。例如，lm-evaluation-harness工具链已集成对该数据集的直接支持，实现了自动化评估流程。同时，tinyBenchmarks系列的其他微型数据集构建也借鉴了其方法论，形成了统一的轻量级评估生态系统，推动了高效评估标准的发展。

数据集最近研究