Reflection-Bench

arXiv2025-09-30 收录

下载链接：

https://github.com/YabYum/ReflectionBench

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个全面的基础评估，涵盖了7个核心认知功能任务，这些任务对于反思至关重要，包括感知、记忆、信念更新、决策、预测、反事实思维和元反思。此外，该评估还包括一系列专门设计来探测AI模型反思能力的心智任务，并在多种大型语言模型上进行了广泛的评估。该规模涉及对13个知名大型语言模型的性能评估，旨在对大型语言模型在反思相关认知任务上的表现进行评价。

This dataset constitutes a comprehensive foundational assessment covering seven core cognitive function tasks critical to reflection, including perception, memory, belief update, decision-making, prediction, counterfactual thinking, and meta-reflection. Furthermore, the assessment includes a set of mental tasks specifically designed to probe the reflective capacities of AI models, and has undergone extensive evaluation across various large language models. The scope of this assessment includes performance evaluations of 13 renowned large language models, aiming to benchmark the performance of large language models on reflection-related cognitive tasks.

5,000+

优质数据集

54 个

任务类型

进入经典数据集