UBENCH

Name: UBENCH
Creator: 南开大学软件学院
Published: 2024-06-19 00:50:38
License: 暂无描述

arXiv2024-06-19 更新2024-06-20 收录

下载链接：

https://github.com/Cyno2232/UBENCH

下载链接

链接失效反馈

官方服务：

资源简介：

UBENCH是由南开大学软件学院创建的一个综合基准，用于评估大型语言模型（LLMs）的可靠性。该数据集包含3978个多选题，覆盖知识、语言、理解和推理四个主要领域，旨在通过这些题目评估LLMs在不同任务中的表现。UBENCH的数据来源于多个公开数据集，经过特殊处理和严格的质量控制，以确保评估的准确性。该数据集适用于广泛的开放源和闭源模型，特别强调高效的推理和可扩展性。UBENCH的应用领域包括但不限于模型评估和改进，旨在解决LLMs在实际应用中的不确定性和可靠性问题。

UBENCH is a comprehensive benchmark developed by the School of Software, Nankai University, to evaluate the reliability of Large Language Models (LLMs). This dataset consists of 3,978 multiple-choice questions spanning four core domains: knowledge, language, comprehension, and reasoning, with the goal of assessing LLMs' performance across various tasks. The data of UBENCH is derived from multiple public datasets, and has been subjected to specialized processing and rigorous quality control to guarantee the accuracy of the evaluation. This benchmark is applicable to a wide range of open-source and closed-source models, with a particular focus on efficient reasoning and scalability. Application areas of UBENCH include but are not limited to model evaluation and improvement, and it aims to resolve the uncertainty and reliability-related issues of LLMs in real-world practical applications.

提供机构：

南开大学软件学院

创建时间：

2024-06-19

搜集汇总

数据集介绍

构建方式

UBENCH 数据集通过精心设计的多项选择题，涵盖了知识、语言、理解和推理四个主要类别，共计 3,978 道题目。数据集的构建过程包括从多个公开数据集中随机抽取样本，并进行格式转换和质量控制，确保每个样本都经过两位作者的审查，并在必要时由第三位作者参与以达成共识。此外，对于没有错误答案的数据集，使用 GPT-4 生成与正确答案相似的错误答案，以确保数据集的全面性和准确性。

使用方法

UBENCH 数据集的使用方法包括对 15 个主流 LLMs 的可靠性评估，涵盖了开源和闭源模型。评估过程中使用了四种评价指标：预期校准误差（ECE）、平均校准误差（ACE）、最大校准误差（MCE）和阈值平均校准误差（TACE）。实验结果表明，UBENCH 在大多数设置下表现优异，尤其在评估 LLMs 的可靠性方面，GLM4 表现最为突出，紧随其后的是 GPT-4。此外，UBENCH 还探索了链式思维提示（CoT）、角色扮演提示、选项顺序和温度参数对 LLMs 可靠性的影响。

背景与挑战

背景概述

UBENCH, introduced by researchers from Nankai University and Tianjin University of Science and Technology, is a comprehensive benchmark designed to evaluate the reliability of large language models (LLMs) through multiple-choice questions. Developed in response to the limitations of existing benchmarks that primarily assess problem-solving abilities while neglecting the uncertainty of responses, UBENCH includes 3,978 questions covering knowledge, language understanding, and reasoning. The dataset aims to provide a systematic and automated evaluation framework, significantly reducing computational resources compared to traditional methods that require multiple samplings. UBENCH has been instrumental in assessing the reliability of 15 popular LLMs, highlighting the need for incorporating uncertainty estimation in LLM evaluations.

当前挑战

The primary challenge addressed by UBENCH is the lack of comprehensive evaluation systems that consider the uncertainty of LLM responses, which can lead to unreliability and potential harm. Traditional uncertainty estimation methods are resource-intensive and often incompatible with black-box models. UBENCH addresses these challenges by requiring only a single sampling instance, thereby reducing computational costs while maintaining evaluation fidelity. Additionally, the benchmark faces challenges in ensuring the quality and diversity of its dataset, as well as in adapting to both open-source and closed-source models. The ongoing challenge is to continuously refine the benchmark to keep pace with the rapid advancements in LLM technology and to expand its scope to include multimodal scenarios and other potential factors affecting LLM reliability.

常用场景

经典使用场景

UBENCH 数据集的经典使用场景在于评估大型语言模型（LLMs）在多选题任务中的不确定性。通过包含 3,978 道涵盖知识、语言、理解和推理能力的多选题，UBENCH 提供了一个全面的基准，用于测试 LLMs 在不同情境下的可靠性。

解决学术问题

UBENCH 数据集解决了学术界在评估 LLMs 可靠性时面临的常见问题，即如何量化和评估模型输出的不确定性。传统的基准主要关注模型的解决问题能力，而忽略了答案的不确定性，这可能导致模型的不可靠性。UBENCH 通过引入不确定性评估，为学术研究提供了新的视角和工具，有助于更全面地理解和改进 LLMs 的性能。

实际应用

在实际应用中，UBENCH 数据集可以帮助开发者和研究人员识别和改进 LLMs 在特定任务中的不确定性表现。例如，在医疗诊断、法律咨询和金融预测等高风险领域，模型的可靠性至关重要。通过使用 UBENCH，可以更好地调整和优化模型，以提高其在实际应用中的准确性和可信度。

数据集最近研究