limbic-eval-tool-use-mcp

Name: limbic-eval-tool-use-mcp
Creator: maas
Published: 2025-12-05 11:54:58
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-30 收录

下载链接：

https://modelscope.cn/datasets/quotientai/limbic-eval-tool-use-mcp

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Summary The MCP Tool Call Evaluation Test Dataset is a synthetic dataset designed for evaluating and benchmarking language models' ability to correctly execute function calls in the context of Model Context Protocol (MCP) tools. This dataset contains 9,813 test examples that assess a model's proficiency in: 1. **Tool Selection**: Choosing the correct function from available tools 2. **Parameter Structure**: Providing all required parameters with correct names 3. **Parameter Values**: Supplying appropriate values that match expected data types and user intent ## Data Fields - **available_tools**: List of available MCP tools with their schemas - **message_history**: Conversation context leading up to the tool call, containing: - **user_request**: The original user query that triggered the tool call - **tool_call**: The actual tool call made by the model (may be correct or incorrect) - **score**: Ground truth classification of the tool call quality - **failure_reason**: Detailed explanation of what went wrong (if applicable) ## Dataset Structure Each instance contains: ```json { "available_tools": [ { "name": "function_name", "description": "Function description", "input_schema": { "type": "object", "properties": {...}, "required": [...] } } ], "message_history": [ { "role": "user|assistant", "content": "Message content" } ], "score": "correct|incorrect_tool|incorrect_parameter_names|incorrect_parameter_values", "failure_reason": "Description of failure (if any)", } ``` ## Dataset Creation ### Curation Rationale This dataset was created to address the need for standardized evaluation of language models' tool-calling capabilities in the context of MCP (Model Context Protocol). The synthetic nature allows for controlled testing scenarios and comprehensive coverage of various failure modes. ### Source Data #### Initial Data Collection and Normalization The dataset was synthetically generated using a combination of: - Real MCP server definitions from the Smithery registry - Automated tool call generation with intentional errors - Manual validation and quality control ### Scores Each example was automatically labeled based on predefined criteria: - **correct**: Tool call matches available tools and parameters exactly and achieves user request - **incorrect_tool**: Function name doesn't exist in available tools or incorrect function was chosen - **incorrect_parameter_names**: Correct function was chosen but parameter names are wrong - **incorrect_parameter_values**: Function and parameters are correct but values are inappropriate ```bibtex @dataset{mcp_tool_call_eval_test, title={MCP Tool Call Evaluation Test Dataset}, author={QuotientAI}, year={2025}, url={https://huggingface.co/datasets/quotientai/limbic-eval-tool-use-mcp} } ```

# 数据集概述 MCP工具调用评估测试数据集是一款合成数据集，旨在评估与基准测试大语言模型（Large Language Model）在模型上下文协议（Model Context Protocol，MCP）工具场景下正确执行函数调用的能力。本数据集包含9813条测试样本，用于评估模型在以下三方面的熟练度： 1. **工具选择**：从可用工具中选取正确的函数 2. **参数结构**：提供所有必填参数且参数名称无误 3. **参数值**：提供符合预期数据类型与用户意图的合理参数值 ## 数据字段 - **available_tools**：包含各MCP工具及其架构信息的可用工具列表 - **message_history**：触发工具调用前的对话上下文，包含： - **user_request**：触发工具调用的原始用户查询 - **tool_call**：模型实际生成的工具调用（可能正确或存在错误） - **score**：工具调用质量的真实标注分类 - **failure_reason**：工具调用出错时的详细错误说明（如适用） ## 数据集结构每个样本包含如下格式内容： json { "available_tools": [ { "name": "function_name", "description": "Function description", "input_schema": { "type": "object", "properties": {...}, "required": [...] } } ], "message_history": [ { "role": "user|assistant", "content": "Message content" } ], "score": "correct|incorrect_tool|incorrect_parameter_names|incorrect_parameter_values", "failure_reason": "Description of failure (if any)" } ## 数据集构建 ### 筛选依据本数据集的构建旨在满足对大语言模型在MCP工具场景下的工具调用能力进行标准化评估的需求。合成数据集的特性可实现可控的测试场景，并全面覆盖各类错误模式。 ### 源数据 #### 初始数据收集与标准化处理本数据集通过以下组合方式合成生成： - 取自Smithery注册表的真实MCP服务器定义 - 带有故意错误的自动化工具调用生成流程 - 人工验证与质量管控环节 ### 评分规则每条样本均基于预设规则自动标注： - **correct**：工具调用与可用工具及参数完全匹配，且可满足用户请求 - **incorrect_tool**：函数名不存在于可用工具列表中，或选取了错误的函数 - **incorrect_parameter_names**：选取了正确的函数，但参数名称有误 - **incorrect_parameter_values**：函数与参数名称均正确，但参数值不合理 bibtex @dataset{mcp_tool_call_eval_test, title={MCP Tool Call Evaluation Test Dataset}, author={QuotientAI}, year={2025}, url={https://huggingface.co/datasets/quotientai/limbic-eval-tool-use-mcp} }

提供机构：

maas

创建时间：

2025-07-28

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个名为MCP Tool Call Evaluation Test Dataset的合成数据集，专门用于评估语言模型在Model Context Protocol（MCP）工具中正确执行函数调用的能力。它包含9,813个测试示例，从工具选择、参数结构和参数值三个维度进行评分，数据以JSON格式组织，包括工具列表、对话历史、评分和失败原因等字段。

以上内容由遇见数据集搜集并总结生成