Reflection-Bench

arXiv2025-09-30 收录

下载链接：

https://github.com/YabYum/ReflectionBench

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个全面的基础评估，涵盖了7个核心认知功能任务，这些任务对于反思至关重要，包括感知、记忆、信念更新、决策、预测、反事实思维和元反思。此外，该评估还包括一系列专门设计来探测AI模型反思能力的心智任务，并在多种大型语言模型上进行了广泛的评估。该规模涉及对13个知名大型语言模型的性能评估，旨在对大型语言模型在反思相关认知任务上的表现进行评价。

This dataset constitutes a comprehensive foundational assessment covering seven core cognitive function tasks critical to reflection, including perception, memory, belief update, decision-making, prediction, counterfactual thinking, and meta-reflection. Furthermore, the assessment includes a set of mental tasks specifically designed to probe the reflective capacities of AI models, and has undergone extensive evaluation across various large language models. The scope of this assessment includes performance evaluations of 13 renowned large language models, aiming to benchmark the performance of large language models on reflection-related cognitive tasks.

搜集汇总

数据集介绍

背景与挑战

背景概述

Reflection-Bench是一个开源基准测试数据集，受认知心理学启发，旨在系统评估大型语言模型作为自主代理的认知代理能力，涵盖预测、决策、感知等七个核心认知维度，并包含7个参数化测试任务。该数据集提供了完整的评估流程和配置，已用于测试多种主流LLMs，结果显示模型在元反思任务上表现普遍较弱。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集