goodbadgreedy/GoodBadGreedy
收藏Hugging Face2024-07-17 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/goodbadgreedy/GoodBadGreedy
下载链接
链接失效反馈官方服务:
资源简介:
该数据集用于评估大语言模型(LLMs)在非确定性生成方面的表现。研究背景是当前对LLMs的评估往往忽略了非确定性,通常只关注每个示例的单一输出。研究问题包括探索贪婪解码和采样之间的性能差异、识别基准测试在非确定性方面的一致性以及检查独特的模型行为。主要发现包括贪婪解码和采样生成之间的显著性能差距、贪婪解码在大多数基准测试中优于采样、数学推理和代码生成受采样方差影响最大等。数据集使用了七个基准测试进行评估,包括AlpacaEval 2、Arena-Hard、WildBench v2、MixEval、MMLU-Redux、GSM8K和HumanEval。
This dataset contains the results of a study evaluating the non-determinism of large language models (LLMs), particularly the performance differences between greedy decoding and sampling generation methods. The data covers seven benchmarks, including AlpacaEval 2, Arena-Hard, WildBench v2, MixEval, MMLU-Redux, GSM8K, and HumanEval, each with detailed instance numbers, sample numbers, and evaluation metrics. The study shows that greedy decoding outperforms sampling generation in most benchmarks except for AlpacaEval. Additionally, the research explores the impact of temperature and repetition penalty on LLMs performance, and the potential of 7B-level LMs to outperform GPT-4-Turbo in the best-of-N sampling setting.
提供机构:
goodbadgreedy



