salem-mbzuai/LLM-BabyBench
收藏Hugging Face2025-05-17 更新2025-11-01 收录
下载链接:
https://hf-mirror.com/datasets/salem-mbzuai/LLM-BabyBench
下载链接
链接失效反馈官方服务:
资源简介:
LLM-BabyBench是一个用于评估大型语言模型在基于环境的规划和推理任务上的基准测试套件。它基于BabyAI网格世界的文本改编版构建,评估LLM在交互式环境内进行规划和推理的能力。LLM-BabyBench特别评估了 grounded intelligence 的三个基本方面:(1)预测动作对环境状态的影响;(2)生成达到指定子目标的低级别动作序列;(3)将高级任务分解为连贯的子目标序列。
LLM-BabyBench is a benchmark suite designed to evaluate Large Language Models (LLMs) on grounded planning and reasoning tasks. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this benchmark assesses the capabilities of LLMs to plan and reason within the constraints of interactive environments. LLM-BabyBench specifically evaluates three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state, (2) generating sequences of low-level actions to achieve specified subgoals, and (3) decomposing high-level missions into coherent subgoal sequences.
提供机构:
salem-mbzuai



