---
dataset_info:
features:
- name: ind
dtype: int32
- name: activity_label
dtype: string
- name: ctx_a
dtype: string
- name: ctx_b
dtype: string
- name: ctx
dtype: string
- name: endings
sequence: string
- name: source_id
dtype: string
- name: split
dtype: string
- name: split_type
dtype: string
- name: label
dtype: string
- name: input_formatted
dtype: string
splits:
- name: train
num_bytes: 160899446
num_examples: 39905
- name: test
num_bytes: 40288101
num_examples: 10003
- name: validation
num_bytes: 473652
num_examples: 100
download_size: 50109798
dataset_size: 201661199
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
- split: validation
path: data/validation-*
language:
- en
pretty_name: tinyHellaswag
size_categories:
- n<1K
multilinguality:
- monolingual
source_datasets:
- Rowan/hellaswag
language_bcp47:
- en-US
---
# tinyHellaswag
Welcome to tinyHellaswag! This dataset serves as a concise version of the [hellaswag](https://huggingface.co/datasets/hellaswag) dataset, offering a subset of 100 data points selected from the original compilation.
tinyHellaswag is designed to enable users to efficiently estimate the performance of a large language model (LLM) with reduced dataset size, saving computational resources
while maintaining the essence of the hellaswag evaluation.
## Features
- **Compact Dataset:** With only 100 data points, tinyHellaswag provides a swift and efficient way to evaluate your LLM's performance against a benchmark set, maintaining the essence of the original hellaswag dataset.
- **Compatibility:** tinyHellaswag is compatible with evaluation using the [lm evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness/), but can also be integrated into your custom pipeline. See below for more details.
## Model Evaluation
_With lm-eval harness_
Users looking to evaluate a new model with tinyHellaswag can use the [lm evaluation harness (v0.4.1 or later)](https://github.com/EleutherAI/lm-evaluation-harness/).
To do so, you can directly run your evaluation harness with `--tasks=tinyHellaswag`:
```shell
lm_eval --model hf --model_args pretrained="<your-model>" --tasks=tinyHellaswag --batch_size=1
```
LM-eval harness will directly output the best accuracy estimator (IRT++), without any additional steps required.
_Without lm-eval harness_
Alternatively, tinyHellaswag can be integrated into any other pipeline by downloading the data via
```python
from datasets import load_dataset
tiny_data = load_dataset('tinyBenchmarks/tinyHellaswag')['validation']
```
Now, `tiny_data` contains the 100 subsampled data points with the same features as the original dataset, as well as an additional field containing the preformatted data points.
The preformatted data points follow the formatting used in the [open llm leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) including the respective in-context examples.
You can then estimate your LLM's performance using the following code. First, ensure you have the tinyBenchmarks package installed:
```shell
pip install git+https://github.com/felipemaiapolo/tinyBenchmarks
```
Then, use the code snippet below for the evaluation:
```python
import numpy as np
import tinyBenchmarks as tb
### Score vector
y = # your original score vector
### Parameters
benchmark = 'hellaswag'
### Evaluation
tb.evaluate(y, benchmark)
```
This process will help you estimate the performance of your LLM against the tinyHellaswag dataset, providing a streamlined approach to benchmarking.
Please be aware that evaluating on multiple GPUs can change the order of outputs in the lm evaluation harness.
Ordering your score vector following the original order in tinyHellaswag will be necessary to use the tinyBenchmarks library.
For more detailed instructions on evaluating new models and computing scores, please refer to the comprehensive guides available at [lm evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness/) and [tinyBenchmarks GitHub](https://github.com/felipemaiapolo/tinyBenchmarks).
Happy benchmarking!
## More tinyBenchmarks
**Open LLM leaderboard**:
[tiny MMLU](https://huggingface.co/datasets/tinyBenchmarks/tinyMMLU),
[tiny Arc-Challenge](https://huggingface.co/datasets/tinyBenchmarks/tinyAI2_arc),
[tiny Winogrande](https://huggingface.co/datasets/tinyBenchmarks/tinyWinogrande),
[tiny TruthfulQA](https://huggingface.co/datasets/tinyBenchmarks/tinyTruthfulQA),
[tiny GSM8k](https://huggingface.co/datasets/tinyBenchmarks/tinyGSM8k)
**AlpacaEval**:
[tiny AlpacaEval](https://huggingface.co/datasets/tinyBenchmarks/tinyAlpacaEval)
**HELM-lite**:
_work-in-progress_
## Citation
@article{polo2024tinybenchmarks,
title={tinyBenchmarks: evaluating LLMs with fewer examples},
author={Felipe Maia Polo and Lucas Weber and Leshem Choshen and Yuekai Sun and Gongjun Xu and Mikhail Yurochkin},
year={2024},
eprint={2402.14992},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{zellers2019hellaswag,
title={HellaSwag: Can a Machine Really Finish Your Sentence?},
author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin},
booktitle ={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
year={2019}
}
dataset_info:
数据集信息:
features:
- 字段名: ind
数据类型: int32
- 字段名: activity_label
数据类型: 字符串
- 字段名: ctx_a
数据类型: 字符串
- 字段名: ctx_b
数据类型: 字符串
- 字段名: ctx
数据类型: 字符串
- 字段名: endings
数据类型: 字符串序列
- 字段名: source_id
数据类型: 字符串
- 字段名: split
数据类型: 字符串
- 字段名: split_type
数据类型: 字符串
- 字段名: label
数据类型: 字符串
- 字段名: input_formatted
数据类型: 字符串
splits:
- 划分集名称: train
字节数: 160899446
样本数: 39905
- 划分集名称: test
字节数: 40288101
样本数: 10003
- 划分集名称: validation
字节数: 473652
样本数: 100
下载大小: 50109798
数据集总大小: 201661199
configs:
- 配置名称: default
数据文件:
- 划分集: train
路径: data/train-*
- 划分集: test
路径: data/test-*
- 划分集: validation
路径: data/validation-*
语言:
- en
显示名称: tinyHellaswag
样本规模类别:
- n<1K
多语言属性:
- 单语言
源数据集:
- Rowan/hellaswag
语言BCP47标签:
- en-US
---
# tinyHellaswag
欢迎来到tinyHellaswag!本数据集是[hellaswag](https://huggingface.co/datasets/hellaswag)数据集的精简版本,从原始数据集集合中选取100条样本作为其子集。
tinyHellaswag旨在帮助用户以更小的数据集规模高效评估大语言模型(Large Language Model,LLM)的性能,在节省计算资源的同时保留了hellaswag评测的核心逻辑。
## 数据集特性
- **轻量精简**:仅包含100条样本,tinyHellaswag可实现大语言模型在基准测试集上的快速高效评估,完整保留原始hellaswag数据集的评测核心。
- **兼容性强**:tinyHellaswag兼容[lm评估工具包(lm evaluation harness)](https://github.com/EleutherAI/lm-evaluation-harness/)的评测流程,同时也可集成至自定义评测管线中,详情见下文。
## 模型评测
### 使用lm-eval评测框架
希望使用tinyHellaswag评测新模型的用户可直接使用[lm评估工具包(v0.4.1及以上版本)](https://github.com/EleutherAI/lm-evaluation-harness/)。具体操作可直接通过以下命令运行评测管线:
shell
lm_eval --model hf --model_args pretrained="<your-model>" --tasks=tinyHellaswag --batch_size=1
lm评估工具包将直接输出最优准确率估计值(IRT++),无需额外操作步骤。
### 不使用lm-eval评测框架
或者,你也可以通过下载数据的方式将tinyHellaswag集成至任意其他评测管线中,代码示例如下:
python
from datasets import load_dataset
tiny_data = load_dataset('tinyBenchmarks/tinyHellaswag')['validation']
此时,`tiny_data`将包含100条下采样后的样本,其特征与原始数据集一致,同时新增了一个包含格式化后样本的字段。格式化后的样本遵循[开放大语言模型排行榜](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)所采用的格式,包含相应的上下文示例。
你可通过以下代码估算你的大语言模型性能。首先,请确保已安装tinyBenchmarks工具包:
shell
pip install git+https://github.com/felipemaiapolo/tinyBenchmarks
随后,使用以下代码片段完成评测:
python
import numpy as np
import tinyBenchmarks as tb
### 得分向量
y = # 你的原始得分向量
### 参数设置
benchmark = 'hellaswag'
### 评测执行
tb.evaluate(y, benchmark)
该流程可帮助你基于tinyHellaswag数据集估算大语言模型的性能,提供了一种轻量化的基准测试方案。
请注意,在多GPU环境下进行评测时,lm评估工具包的输出顺序可能发生变化。因此,若需使用tinyBenchmarks库,需确保你的得分向量与tinyHellaswag原始样本顺序保持一致。
如需了解评测新模型与计算得分的更多详细指南,请参考[lm评估工具包](https://github.com/EleutherAI/lm-evaluation-harness/)与[tinyBenchmarks GitHub仓库](https://github.com/felipemaiapolo/tinyBenchmarks)的官方文档。
祝您基准测试顺利!
## 更多tinyBenchmarks数据集
### 开放大语言模型排行榜系列:
[tiny MMLU](https://huggingface.co/datasets/tinyBenchmarks/tinyMMLU),
[tiny Arc-Challenge](https://huggingface.co/datasets/tinyBenchmarks/tinyAI2_arc),
[tiny Winogrande](https://huggingface.co/datasets/tinyBenchmarks/tinyWinogrande),
[tiny TruthfulQA](https://huggingface.co/datasets/tinyBenchmarks/tinyTruthfulQA),
[tiny GSM8k](https://huggingface.co/datasets/tinyBenchmarks/tinyGSM8k)
### AlpacaEval系列:
[tiny AlpacaEval](https://huggingface.co/datasets/tinyBenchmarks/tinyAlpacaEval)
### HELM-lite系列:
开发中
## 引用格式
bibtex
@article{polo2024tinybenchmarks,
title={tinyBenchmarks: evaluating LLMs with fewer examples},
author={Felipe Maia Polo and Lucas Weber and Leshem Choshen and Yuekai Sun and Gongjun Xu and Mikhail Yurochkin},
year={2024},
eprint={2402.14992},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{zellers2019hellaswag,
title={HellaSwag: Can a Machine Really Finish Your Sentence?},
author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin},
booktitle ={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
year={2019}
}