AR-Bench
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/tmlr-group/AR-Bench
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为AR-Bench,旨在通过三种任务类型:侦探案件、情境谜题和猜数字,来评估大型语言模型(LLM)的主动推理能力。该基准测试在常识推理、逻辑推理和符号推理挑战方面评估表现,突显了LLM在主动推理场景中所面临的困难。任务的名称为“主动推理评估”。
The dataset is named AR-Bench, which aims to evaluate the active reasoning capabilities of Large Language Models (LLMs) using three task types: detective cases, situational puzzles, and number guessing games. This benchmark assesses performance on common-sense, logical, and symbolic reasoning challenges, underscoring the difficulties LLMs face in active reasoning scenarios. The name of this evaluation task is "Active Reasoning Evaluation".
提供机构:
TMLR Group



