下载链接：

https://modelscope.cn/datasets/ibm-research/nestful

下载链接

链接失效反馈

官方服务：

资源简介：

# NESTFUL: Nested Function-Calling Dataset <div> <a width="150" style="display: inline-block" href="https://arxiv.org/abs/2409.03797v3"><img alt="Static Badge" src="https://img.shields.io/badge/arxiv-2409.03797v3-red?logo=arxiv"></a> <a width="150" style="display: inline-block" href="https://github.com/IBM/NESTFUL"><img alt="Static Badge" src="https://img.shields.io/badge/GitHub-IBM/NESTFUL-blue?logo=github"></a> </div> NESTFUL is a benchmark to evaluate LLMs on nested sequences of API calls, i.e., sequences where the output of one API call is passed as input to a subsequent call. The NESTFUL dataset includes over 1800 nested sequences from two main areas: mathematical reasoning and coding tools. The mathematical reasoning portion is generated from the [MathQA](https://huggingface.co/datasets/allenai/math_qa) dataset, while the coding portion is generated from the [StarCoder2-Instruct](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k) dataset. All function calls in the dataset are executable. Please refer to the [paper](https://arxiv.org/abs/2409.03797v2) for more details. <div style="text-align: center;"> <img src="./figures/nestful_end2end.png" alt="overview" width="720" style="margin: auto;"> </div> ## Data Structure The dataset contains the following fields: 1. `sample_id (str)`: A unique ID for each sample in the dataset 2. `input (str)`: The user query that needs to be answered by the model using function calls 3. `tools (list[dict])`: A catalog of tools available to the model for the corresponding query 4. `output (list[dict])`: The ground truth sequence of functions to answer the user query 5. `gold_answer`: The final answer upon executing the ground truth function calls. *Note: Columns `tools`, `output`, and `gold_answer` are formatted as string, but they can be reformatted to the original type using `json.loads` for `tools` and `output` and `eval` for the `gold_answer` field.* **Executable Functions:** To get the executable functions, please go to the GitHub Repo at: https://github.com/IBM/NESTFUL/tree/main/data_v2/executable_functions ## Data sample In the example shown below (tools list is truncated for brevity), each element of the `output` list is a function call. Each function call assigns a `label` to the output of that function, for example `"label": "$var_1"`. To refer the output of a previous function in the current function call, the argument value is specified as `${label_name}.{variable_name}$`, for example: `"arg_1": "$var_2.result$"`. <details> <summary>Expand to see the data sample</summary> ```json { "sample_id": "4af7a62d-58fd-431f-a11f-eff486e10987", "input": "find the average of all the number between 6 and 34 which are divisible by 5.", "tools": [ { "name": "inverse", "description": "Return the inverse (reciprocal) of a number", "parameters": { "arg_0": { "description": "The number to inverse", "type": "int or float" } }, "output_parameter": { "result": { "description": "The inverse result", "type": "int or float" } } }, ... ], "output": [ { "name": "add", "label": "$var_1", "arguments": { "arg_0": 6, "arg_1": 4 } }, { "name": "subtract", "label": "$var_2", "arguments": { "arg_0": 34, "arg_1": 4 } }, { "name": "add", "label": "$var_3", "arguments": { "arg_0": "$var_1.result$", "arg_1": "$var_2.result$" } }, { "name": "divide", "label": "$var_4", "arguments": { "arg_0": "$var_3.result$", "arg_1": 2 } } ], "gold_answer": 20.0 } ``` </details> ## Benchmark results We evaluated NESTFUL using 15 open-source models with sizes varying from 1B up to 405B parameters. We observe that the best function calling models have low performance numbers, indicating the complexity of the nested sequencing problem. Common issues with the models include: Difficulty assigning variables, Failing to utilize output parameter details from API specifications, Incorrectly passing variable names and output parameters to subsequent APIs. <div style="text-align: center;"> <img src="./figures/nestful_results.png" alt="results" width="720" style="margin: auto;"> </div> ## Citation ```bibtex @article{basu2024nestful, title={NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls}, author={Basu, Kinjal and Abdelaziz, Ibrahim and Kate, Kiran and Agarwal, Mayank and Crouse, Maxwell and Rizk, Yara and Bradford, Kelsey and Munawar, Asim and Kumaravel, Sadhana and Goyal, Saurabh and others}, journal={arXiv preprint arXiv:2409.03797}, year={2024} } ```

# NESTFUL：嵌套函数调用数据集 <div> <a width="150" style="display: inline-block" href="https://arxiv.org/abs/2409.03797v3"><img alt="静态徽章" src="https://img.shields.io/badge/arxiv-2409.03797v3-red?logo=arxiv"></a> <a width="150" style="display: inline-block" href="https://github.com/IBM/NESTFUL"><img alt="静态徽章" src="https://img.shields.io/badge/GitHub-IBM/NESTFUL-blue?logo=github"></a> </div> NESTFUL是一款用于评估大语言模型（Large Language Model，LLM）对嵌套API调用序列处理能力的基准测试集，这类序列指将前一次API调用的输出作为后续调用输入的调用流程。 NESTFUL数据集包含来自两大核心领域的1800余条嵌套调用序列：数学推理与代码工具。其中数学推理子集源自[MathQA](https://huggingface.co/datasets/allenai/math_qa)数据集，代码工具子集源自[StarCoder2-Instruct](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k)数据集。数据集中所有函数调用均具备可执行性。更多细节请参阅[相关论文](https://arxiv.org/abs/2409.03797v2)。 <div style="text-align: center;"> <img src="./figures/nestful_end2end.png" alt="概览" width="720" style="margin: auto;"> </div> ## 数据结构数据集包含以下字段： 1. `sample_id (str)`：数据集中每个样本的唯一标识符 2. `input (str)`：需要模型通过函数调用来完成解答的用户查询语句 3. `tools (list[dict])`：对应查询场景下模型可调用的工具目录 4. `output (list[dict])`：用于响应用户查询的标准函数调用序列（真值标签） 5. `gold_answer`：执行标准函数调用序列后得到的最终标准答案 *注意：`tools`、`output`与`gold_answer`字段均以字符串格式存储，可通过`json.loads()`对`tools`和`output`进行格式还原，通过`eval()`对`gold_answer`进行格式还原。* **可执行函数：** 如需获取可执行函数，请访问GitHub仓库：https://github.com/IBM/NESTFUL/tree/main/data_v2/executable_functions ## 数据样例如下示例中（为简洁起见，工具列表已截断），`output`列表的每个元素均为一条函数调用。每条函数调用会为该函数的输出分配一个标签，例如`"label": "$var_1"`。若要在当前函数调用中引用前序函数的输出，需将参数值设置为`${label_name}.{variable_name}$`格式，例如：`"arg_1": "$var_2.result$"`。 <details> <summary>展开查看数据样例</summary> json { "sample_id": "4af7a62d-58fd-431f-a11f-eff486e10987", "input": "find the average of all the number between 6 and 34 which are divisible by 5.", "tools": [ { "name": "inverse", "description": "Return the inverse (reciprocal) of a number", "parameters": { "arg_0": { "description": "The number to inverse", "type": "int or float" } }, "output_parameter": { "result": { "description": "The inverse result", "type": "int or float" } } }, ... ], "output": [ { "name": "add", "label": "$var_1", "arguments": { "arg_0": 6, "arg_1": 4 } }, { "name": "subtract", "label": "$var_2", "arguments": { "arg_0": 34, "arg_1": 4 } }, { "name": "add", "label": "$var_3", "arguments": { "arg_0": "$var_1.result$", "arg_1": "$var_2.result$" } }, { "name": "divide", "label": "$var_4", "arguments": { "arg_0": "$var_3.result$", "arg_1": 2 } } ], "gold_answer": 20.0 } </details> ## 基准测试结果我们使用15个参数量从10亿到4050亿不等的开源大语言模型对NESTFUL进行了评估。测试结果显示，表现最优的函数调用模型性能仍处于较低水平，这反映出嵌套序列调用问题的复杂性。当前模型普遍存在的问题包括：变量分配困难、无法正确利用API规范中的输出参数细节、向后续API错误传递变量名与输出参数等。 <div style="text-align: center;"> <img src="./figures/nestful_results.png" alt="测试结果" width="720" style="margin: auto;"> </div> ## 引用 bibtex @article{basu2024nestful, title={NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls}, author={Basu, Kinjal and Abdelaziz, Ibrahim and Kate, Kiran and Agarwal, Mayank and Crouse, Maxwell and Rizk, Yara and Bradford, Kelsey and Munawar, Asim and Kumaravel, Sadhana and Goyal, Saurabh and others}, journal={arXiv preprint arXiv:2409.03797}, year={2024} }

应用场景：