AR-Bench

Name: AR-Bench
Creator: TMLR Group
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/tmlr-group/AR-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为AR-Bench，旨在通过三种任务类型：侦探案件、情境谜题和猜数字，来评估大型语言模型（LLM）的主动推理能力。该基准测试在常识推理、逻辑推理和符号推理挑战方面评估表现，突显了LLM在主动推理场景中所面临的困难。任务的名称为“主动推理评估”。

The dataset is named AR-Bench, which aims to evaluate the active reasoning capabilities of Large Language Models (LLMs) using three task types: detective cases, situational puzzles, and number guessing games. This benchmark assesses performance on common-sense, logical, and symbolic reasoning challenges, underscoring the difficulties LLMs face in active reasoning scenarios. The name of this evaluation task is "Active Reasoning Evaluation".

提供机构：

TMLR Group

5,000+

优质数据集

54 个

任务类型

进入经典数据集