ICIP/LiveMCPBench
收藏Hugging Face2025-08-07 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/ICIP/LiveMCPBench
下载链接
链接失效反馈官方服务:
资源简介:
LiveMCPBench是第一个旨在在多样化的模型上下文协议(MCP)服务器上大规模评估LLM代理的全面基准。它包括95个基于MCP生态系统的现实世界任务,要求代理在复杂、工具丰富和动态的日常场景中有效地使用各种工具。为了支持可扩展和可复制的评估,LiveMCPBench配备了LiveMCPTool(一个包含70个MCP服务器和527个工具的集合)和LiveMCPEval(一个LLM作为裁判的框架,可实现自动和自适应评估)。该基准为在现实、工具丰富和动态的MCP环境中对LLM代理进行基准测试提供了一个统一的框架,为代理能力的可扩展和可复制研究奠定了坚实基础。
LiveMCPBench is the first comprehensive benchmark designed to evaluate LLM agents at scale across diverse Model Context Protocol (MCP) servers. It comprises 95 real-world tasks grounded in the MCP ecosystem, challenging agents to effectively use various tools in daily scenarios within complex, tool-rich, and dynamic environments. To support scalable and reproducible evaluation, LiveMCPBench is complemented by LiveMCPTool (a diverse collection of 70 MCP servers and 527 tools) and LiveMCPEval (an LLM-as-a-Judge framework that enables automated and adaptive evaluation). The benchmark offers a unified framework for benchmarking LLM agents in realistic, tool-rich, and dynamic MCP environments, laying a solid foundation for scalable and reproducible research on agent capabilities.
提供机构:
ICIP



