Cap-alfaMike/eval-agent-lab-benchmark

Name: Cap-alfaMike/eval-agent-lab-benchmark
Creator: Cap-alfaMike
Published: 2026-04-23 01:41:22
License: 暂无描述

Hugging Face2026-04-23 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Cap-alfaMike/eval-agent-lab-benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

EvalAgentLab Benchmark v2.0是一个精选的基准数据集，用于评估LLM输出和代理工作流程在三个评估维度上的表现：正确性、技能遵循和执行效率。它不仅评估模型回答的内容，还评估模型如何得出答案。数据集结构详细，包含输入查询、预期输出、可接受输出、预期工具、工具策略、最大步骤数等多个字段。数据集包含两个子集：core_evaluation_suite（15个项目，涵盖知识、推理、计算、工具使用等类别）和tool_selection_benchmark（5个项目，涵盖计算、搜索、检索等类别）。

EvalAgentLab Benchmark v2.0 is a curated benchmark dataset for evaluating LLM outputs and agentic workflows across three evaluation axes: correctness, skill adherence, and execution efficiency. It evaluates not only what models answer, but how they arrive at the answer. The dataset structure includes fields such as input query, expected output, acceptable outputs, expected tools, tool strategy, max steps, etc. It contains two subsets: core_evaluation_suite (15 items, covering categories like knowledge, reasoning, computation, tool_use) and tool_selection_benchmark (5 items, covering computation, search, retrieval).

提供机构：

Cap-alfaMike

5,000+

优质数据集

54 个

任务类型

进入经典数据集