ToolHop
收藏魔搭社区2026-01-02 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/bytedance-research/ToolHop
下载链接
链接失效反馈官方服务:
资源简介:
# [ACL 2025] ToolHop
## [ACL 2025] ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use
> Data for the paper [ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use](https://arxiv.org/abs/2501.02506)
Junjie Ye
jjye23@m.fudan.edu.cn
Jan. 07, 2025
## Introduction
Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present *ToolHop*, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches.
<p align="center">
<img src="figures/scheme.jpg" width="600"/>
</p>
## What's New
- **[2025/05/15]** The paper has been accepted by ACL 2025 **Main** Conference.
- **[2025/01/07]** Release the data and code for ToolHop.
- **[2025/01/07]** Paper available on [Arxiv](https://arxiv.org/abs/2501.02506).
## Main Results
We conduct a detailed analysis of 14 LLMs, covering five distinct families.
<p align="center">
<img src="figures/result.jpg" width="600"/>
</p>
## Usage
### Requirement
- Run the command to install the packages required.
```bash
pip install -r requirements.txt
```
### Evaluation for Open-Source LLMs
- Run the command to evaluate the Open-Source LLMs. We currently support evaluation for LLaMA3.1 and Qwen2.5 families.
```bash
cd code
python3 evaluation_open.py --scenario [Direct/Mandatory/Free] --series [llama31/qwen25] --model_path ${model_path} --output_file ${output_file}
```
### Evaluation for Closed-Source LLMs
- Run the command to evaluate the Closed-Source LLMs. We currently support evaluation for Gemini1.5, Claude3.5, and GPT families.
```bash
cd code
python3 evaluation_closed.py --scenario [Direct/Mandatory/Free] --series [gemini15/claude35/gpt] --model ${model} --base_url ${base_url} --api_key ${api_key} --output_file ${output_file}
```
## License
The [code](code) is licensed under the [Apache License 2.0](LICENSE).
The [ToolHop](data) dataset is licensed under the [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) License.
## Acknowledgement
The dataset is built upon [MorehopQA](https://huggingface.co/datasets/alabnii/morehopqa).
## Citation
If you find this project useful in your research, please cite:
```bibtex
@inproceedings{ToolHop,
author = {Junjie Ye and
Zhengyin Du and
Xuesong Yao and
Weijian Lin and
Yufei Xu and
Zehui Chen and
Zaiyuan Wang and
Sining Zhu and
Zhiheng Xi and
Siyu Yuan and
Tao Gui and
Qi Zhang and
Xuanjing Huang and
Jiecao Chen},
editor = {Wanxiang Che and
Joyce Nabende and
Ekaterina Shutova and
Mohammad Taher Pilehvar},
title = {ToolHop: {A} Query-Driven Benchmark for Evaluating Large Language
Models in Multi-Hop Tool Use},
booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), {ACL} 2025, Vienna, Austria,
July 27 - August 1, 2025},
pages = {2995--3021},
publisher = {Association for Computational Linguistics},
year = {2025},
url = {https://aclanthology.org/2025.acl-long.150/},
timestamp = {Thu, 24 Jul 2025 21:25:39 +0200},
biburl = {https://dblp.org/rec/conf/acl/YeDYLXCWZXYGZ0C25.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
# [ACL 2025] ToolHop
## [ACL 2025] ToolHop: 面向大语言模型(Large Language Model, LLM)多跳工具调用评估的查询驱动基准测试集
> 对应论文《ToolHop: 面向大语言模型多跳工具调用评估的查询驱动基准测试集》(https://arxiv.org/abs/2501.02506)
Junjie Ye
jjye23@m.fudan.edu.cn
2025年1月7日
## 引言
对多跳工具调用进行有效评估,是分析大语言模型的理解能力、推理能力与函数调用能力的关键所在。然而,当前缺乏可靠的评估数据集,这一痛点掣肘了相关研究进展。为解决该问题,我们推出了ToolHop基准数据集:该数据集包含995条用户查询与3912个关联工具,专为多跳工具调用的严谨评估而设计。ToolHop通过一套包含工具创建、文档优化与代码生成的全新查询驱动数据构建流程,保障了查询的多样性、关联关系的合理性、工具的本地可执行性、反馈的详尽性与答案的可验证性。我们针对5个模型家族的14个大语言模型开展了评估,涵盖LLaMA3.1、Qwen2.5、Gemini1.5、Claude3.5与GPT系列,揭示了大语言模型在处理多跳工具调用场景时面临的显著挑战。表现最优的GPT-4o模型仅达到49.04%的准确率,这表明现有模型仍有大幅提升空间。进一步分析还揭示了不同模型家族在工具调用策略上的差异,可为开发更高效的解决方案提供可落地的指导思路。
<p align="center">
<img src="figures/scheme.jpg" width="600"/>
</p>
## 最新动态
- **[2025/05/15]** 本论文已被ACL 2025主会收录
- **[2025/01/07]** 发布ToolHop数据集与代码
- **[2025/01/07]** 论文已上传至[ArXiv](https://arxiv.org/abs/2501.02506)
## 主要实验结果
我们针对覆盖5个不同模型家族的14个大语言模型开展了详尽分析。
<p align="center">
<img src="figures/result.jpg" width="600"/>
</p>
## 使用指南
### 依赖配置
- 执行以下命令安装所需依赖包:
bash
pip install -r requirements.txt
### 开源大语言模型评估
- 执行以下命令对开源大语言模型进行评估。当前我们支持LLaMA3.1与Qwen2.5系列模型的评估:
bash
cd code
python3 evaluation_open.py --scenario [Direct/Mandatory/Free] --series [llama31/qwen25] --model_path ${model_path} --output_file ${output_file}
### 闭源大语言模型评估
- 执行以下命令对闭源大语言模型进行评估。当前我们支持Gemini1.5、Claude3.5与GPT系列模型的评估:
bash
cd code
python3 evaluation_closed.py --scenario [Direct/Mandatory/Free] --series [gemini15/claude35/gpt] --model ${model} --base_url ${base_url} --api_key ${api_key} --output_file ${output_file}
## 许可协议
本项目的[代码](code)采用[Apache License 2.0](LICENSE)协议进行许可。[ToolHop](data)数据集采用[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)协议进行许可。
## 致谢
本数据集基于[MorehopQA](https://huggingface.co/datasets/alabnii/morehopqa)构建。
## 引用方式
如果您的研究中使用了本项目,请引用如下文献:
bibtex
@inproceedings{ToolHop,
author = {Junjie Ye and
Zhengyin Du and
Xuesong Yao and
Weijian Lin and
Yufei Xu and
Zehui Chen and
Zaiyuan Wang and
Sining Zhu and
Zhiheng Xi and
Siyu Yuan and
Tao Gui and
Qi Zhang and
Xuanjing Huang and
Jiecao Chen},
editor = {Wanxiang Che and
Joyce Nabende and
Ekaterina Shutova and
Mohammad Taher Pilehvar},
title = {ToolHop: {A} Query-Driven Benchmark for Evaluating Large Language
Models in Multi-Hop Tool Use},
booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), {ACL} 2025, Vienna, Austria,
July 27 - August 1, 2025},
pages = {2995--3021},
publisher = {Association for Computational Linguistics},
year = {2025},
url = {https://aclanthology.org/2025.acl-long.150/},
timestamp = {Thu, 24 Jul 2025 21:25:39 +0200},
biburl = {https://dblp.org/rec/conf/acl/YeDYLXCWZXYGZ0C25.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
提供机构:
maas
创建时间:
2025-08-25



