APPBENCH

Name: APPBENCH
Creator: RuleGreen
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/ruleGreen/AppBench

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为AppBench，是一个专门用于评估大型语言模型（LLM）在规划和执行来自不同来源的多个API以完成用户任务的能力的基准测试。它旨在解决如图结构、权限限制等挑战。该数据集包含了在多个LLM上的实验结果，揭示了APP和API预测的性能指标，如精确度、召回率和F1分数。任务内容是评估LLM在选择合适的应用程序（App）、选择API以及根据用户指令执行API时填充参数的能力。

This dataset, named AppBench, is a benchmark specifically designed to evaluate the capabilities of Large Language Models (LLMs) in planning and executing multiple APIs from diverse sources to complete user-specified tasks. It aims to address challenges such as graph structures, permission restrictions, and other similar issues. This dataset includes experimental results across multiple LLMs, revealing performance metrics for App and API prediction, including precision, recall, and F1-score. The core task of this benchmark is to evaluate the ability of LLMs to select appropriate applications (Apps), choose suitable APIs, and fill in parameters when executing APIs based on user instructions.

提供机构：

RuleGreen

5,000+

优质数据集

54 个

任务类型

进入经典数据集