T-Eval

Name: T-Eval
Creator: maas
Published: 2026-01-06 16:15:07
License: 暂无描述

魔搭社区2026-01-06 更新2024-06-08 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/T-Eval

下载链接

链接失效反馈

官方服务：

资源简介：

# T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step [![arXiv](https://img.shields.io/badge/arXiv-2312.14033-b31b1b.svg)](https://arxiv.org/abs/2312.14033) [![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE) ## ✨ Introduction This is an evaluation harness for the benchmark described in [T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step](https://arxiv.org/abs/2312.14033). [[Paper](https://arxiv.org/abs/2312.14033)] [[Project Page](https://open-compass.github.io/T-Eval/)] [[LeaderBoard](https://open-compass.github.io/T-Eval/leaderboard.html)] [[HuggingFace](https://huggingface.co/datasets/lovesnowbest/T-Eval)] > Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool utilization capability of LLMs is still under-explored. In contrast to previous works that evaluate models holistically, we comprehensively decompose the tool utilization into multiple sub-processes, including instruction following, planning, reasoning, retrieval, understanding, and review. Based on that, we further introduce T-Eval to evaluate the tool-utilization capability step by step. T-Eval disentangles the tool utilization evaluation into several sub-domains along model capabilities, facilitating the inner understanding of both holistic and isolated competency of LLMs. We conduct extensive experiments on T-Eval and in-depth analysis of various LLMs. T-Eval not only exhibits consistency with the outcome-oriented evaluation but also provides a more fine-grained analysis of the capabilities of LLMs, providing a new perspective in LLM evaluation on tool-utilization ability.  <div> <center> <img src="figs/teaser.png"> </div> ## 🚀 What's New - **[2024.02.18]** Release new [data](https://drive.google.com/file/d/1nQ0pn26qd0FGU8UkfSTxNdu6uWI0QXTY/view?usp=sharing) (both Chinese and English) and code for faster inference!🚀🚀🚀 The leaderboard will be updated soon! We also provide template examples for reference. - **[2024.01.08]** Release [ZH Leaderboard](https://open-compass.github.io/T-Eval/leaderboard_zh.html) and ~~[ZH data](https://drive.google.com/file/d/1z25duwZAnBrPN5jYu9-8RMvfqnwPByKV/view?usp=sharing)~~, where the questions and answer formats are in Chinese. （公布了中文评测数据集和榜单）✨✨✨ - **[2023.12.22]** Paper available on [ArXiv](https://arxiv.org/abs/2312.14033). 🔥🔥🔥 - **[2023.12.21]** Release the test scripts and data for T-Eval. 🎉🎉🎉 ## 🧾 TODO - [x] Change the role of function response from `system` to `function`. - [x] Merge consecutive same role conversations. - [x] Provide template configs for open-sourced models. - [x] Provide dev set for T-Eval, reducing the evaluation time. - [x] Optimize the inference pipeline of huggingface model provided by Lagent, which will be 3x faster. **(Please upgrade Lagent to v0.2)** - [ ] Support inference on Opencompass. ~~NOTE: These TODOs will be started after 2024.2.1~~ Thanks for your patience! ## 🛠️ Preparations ```bash $ git clone https://github.com/open-compass/T-Eval.git $ cd T-Eval $ pip install requirements.txt ``` ## 🛫️ Get Started We support both API-based models and HuggingFace models via [Lagent](https://github.com/InternLM/lagent). ### 💾 Test Data We provide both google drive & huggingface dataset to download test data: 1. Google Drive ~~[[EN data](https://drive.google.com/file/d/1ebR6WCCbS9-u2x7mWpWy8wV_Gb6ltgpi/view?usp=sharing)] (English format) [[ZH data](https://drive.google.com/file/d/1z25duwZAnBrPN5jYu9-8RMvfqnwPByKV/view?usp=sharing)] (Chinese format)~~ [T-Eval Data](https://drive.google.com/file/d/1nQ0pn26qd0FGU8UkfSTxNdu6uWI0QXTY/view?usp=sharing) 2. HuggingFace Datasets You can also access the dataset through huggingface via this [link](https://huggingface.co/datasets/lovesnowbest/T-Eval). ```python from datasets import load_dataset dataset = load_dataset("lovesnowbest/T-Eval") ``` After downloading, please put the data in the `data` folder directly: ``` - data/ - instruct_v2.json - plan_json_v2.json ... ``` ### 🤖 API Models 1. Set your OPENAI key in your environment. ```bash export OPENAI_API_KEY=xxxxxxxxx ``` 2. Run the model with the following scripts ```bash # test all data at once sh test_all_en.sh api gpt-4-1106-preview gpt4 # test ZH dataset sh test_all_zh.sh api gpt-4-1106-preview gpt4 # test for Instruct only python test.py --model_type api --model_path gpt-4-1106-preview --resume --out_name instruct_gpt4.json --out_dir work_dirs/gpt4/ --dataset_path data/instruct_v2.json --eval instruct --prompt_type json ``` ### 🤗 HuggingFace Models 1. Download the huggingface model to your local path. 2. Modify the `meta_template` json according to your tested model. 3. Run the model with the following scripts ```bash # test all data at once sh test_all_en.sh hf $HF_PATH $HF_MODEL_NAME $META_TEMPLATE # test ZH dataset sh test_all_zh.sh hf $HF_PATH $HF_MODEL_NAME $META_TEMPLATE # test for Instruct only python test.py --model_type hf --model_path $HF_PATH --resume --out_name instruct_$HF_MODEL_NAME.json --out_dir data/work_dirs/ --dataset_path data/instruct_v1.json --eval instruct --prompt_type json --model_display_name $HF_MODEL_NAME --meta_template $META_TEMPLATE ``` ### 💫 Final Results Once you finish all tested samples, a detailed evluation results will be logged at `$out_dir/$model_display_name/$model_display_name_-1.json` (For ZH dataset, there is a `_zh` suffix). To obtain your final score, please run the following command: ```bash python teval/utils/convert_results.py --result_path $out_dir/$model_display_name/$model_display_name_-1.json ``` ## 🔌 Protocols T-Eval adopts multi-conversation style evaluation to gauge the model. The format of our saved prompt is as follows: ```python [ { "role": "system", "content": "You have access to the following API:\n{'name': 'AirbnbSearch.search_property_by_place', 'description': 'This function takes various parameters to search properties on Airbnb.', 'required_parameters': [{'name': 'place', 'type': 'STRING', 'description': 'The name of the destination.'}], 'optional_parameters': [], 'return_data': [{'name': 'property', 'description': 'a list of at most 3 properties, containing id, name, and address.'}]}\nPlease generate the response in the following format:\ngoal: goal to call this action\n\nname: api name to call\n\nargs: JSON format api args in ONLY one line\n" }, { "role": "user", "content": "Call the function AirbnbSearch.search_property_by_place with the parameter as follows: 'place' is 'Berlin'." } ] ``` where `role` can be ['system', 'user', 'assistant'], and `content` must be in string format. Before infering it by a LLM, we need to construct it into a raw string format via `meta_template`. `meta_template` examples are provided at [meta_template.py](teval/utils/meta_template.py): ```python [ dict(role='system', begin='<|System|>:', end='\n'), dict(role='user', begin='<|User|>:', end='\n'), dict( role='assistant', begin='<|Bot|>:', end='<eoa>\n', generate=True) ] ``` You need to specify the `begin` and `end` token based on your tested huggingface model at [meta_template.py](teval/utils/meta_template.py) and specify the `meta_template` args in `test.py`, same as the name you set in the `meta_template.py`. As for OpenAI model, we will handle that for you. ## 📊 Benchmark Results More detailed and comprehensive benchmark results can refer to 🏆 [T-Eval official leaderboard](https://open-compass.github.io/T-Eval/leaderboard.html) ! <div> <center> <img src="figs/teval_results.png"> </div> ### ✉️ Submit Your Results You can submit your inference results (via running test.py) to this [email](lovesnow@mail.ustc.edu.cn). We will run your predictions and update the results in our leaderboard. Please also provide the scale of your tested model. A sample structure of your submission should be like: ``` $model_display_name/ instruct_$model_display_name/ query_0_1_0.json query_0_1_1.json ... plan_json_$model_display_name/ plan_str_$model_display_name/ ... ``` ## ❤️ Acknowledgements T-Eval is built with [Lagent](https://github.com/InternLM/lagent) and [OpenCompass](https://github.com/open-compass/opencompass). Thanks for their awesome work! ## 🖊️ Citation If you find this project useful in your research, please consider cite: ``` @article{chen2023t, title={T-Eval: Evaluating the Tool Utilization Capability Step by Step}, author={Chen, Zehui and Du, Weihua and Zhang, Wenwei and Liu, Kuikun and Liu, Jiangning and Zheng, Miao and Zhuo, Jingming and Zhang, Songyang and Lin, Dahua and Chen, Kai and others}, journal={arXiv preprint arXiv:2312.14033}, year={2023} } ``` ## 💳 License This project is released under the Apache 2.0 [license](./LICENSE).

# T-Eval：逐步评估大语言模型的工具使用能力 [![arXiv](https://img.shields.io/badge/arXiv-2312.14033-b31b1b.svg)](https://arxiv.org/abs/2312.14033) [![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE) ## ✨ 简介本项目为论文《T-Eval：逐步评估大语言模型的工具使用能力》中提出的基准测试提供评估框架。 [[论文](https://arxiv.org/abs/2312.14033)] [[项目主页](https://open-compass.github.io/T-Eval/)] [[评测榜单](https://open-compass.github.io/T-Eval/leaderboard.html)] [[HuggingFace数据集](https://huggingface.co/datasets/lovesnowbest/T-Eval)] > 大语言模型（Large Language Model，以下简称LLM）在各类自然语言处理（Natural Language Processing，简称NLP）任务中已取得卓越性能，并通过工具扩展得以适配更广泛的应用场景。然而，如何评估与分析大语言模型的工具使用能力，目前仍有待深入探索。与以往整体评估模型的研究不同，本文将工具使用流程全面拆解为指令遵循、规划、推理、检索、理解与复盘等多个子过程。基于此，我们提出了T-Eval基准，用于逐步评估大语言模型的工具使用能力。T-Eval沿着模型能力维度，将工具使用评估拆分为多个细分领域，便于深入理解大语言模型的整体与单项能力。我们基于T-Eval开展了大量实验，并对多种大语言模型进行了深度分析。T-Eval不仅与结果导向型评估保持一致性，还能对大语言模型的能力进行更细粒度的分析，为大语言模型工具使用能力的评估提供了全新视角。  <div> <center> <img src="figs/teaser.png"> </div> ## 🚀 最新动态 - **[2024.02.18]** 发布全新中英双语数据集与加速推理代码！🚀🚀🚀 评测榜单即将更新，同时我们提供了模板示例供参考。 - **[2024.01.08]** 推出中文评测榜单（[ZH Leaderboard](https://open-compass.github.io/T-Eval/leaderboard_zh.html)）与~~中文数据集~~（已归档），所有题目与作答格式均为中文。✨✨✨ - **[2023.12.22]** 论文已在ArXiv平台发布（[链接](https://arxiv.org/abs/2312.14033)）。🔥🔥🔥 - **[2023.12.21]** 正式发布T-Eval的测试脚本与数据集。🎉🎉🎉 ## 🧾 待完成事项 - [x] 将函数响应的角色从`system`调整为`function`。 - [x] 合并连续的同角色对话。 - [x] 为开源模型提供模板配置文件。 - [x] 提供T-Eval开发集，缩短评估耗时。 - [x] 优化Lagent提供的HuggingFace模型推理流程，推理速度可提升3倍。**（请将Lagent升级至v0.2版本）** - [ ] 支持在Opencompass平台上进行推理。 ~~注：上述待完成事项将于2024.2.1后启动~~ 感谢您的耐心等待！ ## 🛠️ 准备工作 bash $ git clone https://github.com/open-compass/T-Eval.git $ cd T-Eval $ pip install requirements.txt ## 🛫️ 快速开始本项目通过[Lagent](https://github.com/InternLM/lagent)同时支持基于API的模型与HuggingFace本地模型。 ### 💾 测试数据集我们提供两种渠道下载测试数据集： 1. Google Drive：~~[英文数据集](https://drive.google.com/file/d/1ebR6WCCbS9-u2x7mWpWy8wV_Gb6ltgpi/view?usp=sharing)（英文格式）~~[~~中文数据集~~](https://drive.google.com/file/d/1z25duwZAnBrPN5jYu9-8RMvfqnwPByKV/view?usp=sharing)（中文格式）~~，统一更新的数据集请见[T-Eval数据集](https://drive.google.com/file/d/1nQ0pn26qd0FGU8UkfSTxNdu6uWI0QXTY/view?usp=sharing)~~ 2. HuggingFace数据集：你也可以通过以下链接从HuggingFace获取数据集：[链接](https://huggingface.co/datasets/lovesnowbest/T-Eval) python from datasets import load_dataset dataset = load_dataset("lovesnowbest/T-Eval") 下载完成后，请将数据集直接放入`data`文件夹中，目录结构如下： - data/ - instruct_v2.json - plan_json_v2.json ... ### 🤖 API类模型 1. 在环境变量中配置你的OpenAI密钥： bash export OPENAI_API_KEY=xxxxxxxxx 2. 执行以下脚本运行模型： bash # 一次性测试所有英文数据 sh test_all_en.sh api gpt-4-1106-preview gpt4 # 测试中文数据集 sh test_all_zh.sh api gpt-4-1106-preview gpt4 # 仅测试指令遵循子任务 python test.py --model_type api --model_path gpt-4-1106-preview --resume --out_name instruct_gpt4.json --out_dir work_dirs/gpt4/ --dataset_path data/instruct_v2.json --eval instruct --prompt_type json ### 🤗 HuggingFace本地模型 1. 将HuggingFace模型下载至本地路径。 2. 根据待测试的模型修改`meta_template`配置文件。 3. 执行以下脚本运行模型： bash # 一次性测试所有英文数据 sh test_all_en.sh hf $HF_PATH $HF_MODEL_NAME $META_TEMPLATE # 测试中文数据集 sh test_all_zh.sh hf $HF_PATH $HF_MODEL_NAME $META_TEMPLATE # 仅测试指令遵循子任务 python test.py --model_type hf --model_path $HF_PATH --resume --out_name instruct_$HF_MODEL_NAME.json --out_dir data/work_dirs/ --dataset_path data/instruct_v1.json --eval instruct --prompt_type json --model_display_name $HF_MODEL_NAME --meta_template $META_TEMPLATE ### 💫 生成最终结果完成所有测试样本的推理后，详细的评估结果将保存至`$out_dir/$model_display_name/$model_display_name_-1.json`（针对中文数据集，结果文件将带有`_zh`后缀）。如需获取最终得分，请执行以下命令： bash python teval/utils/convert_results.py --result_path $out_dir/$model_display_name/$model_display_name_-1.json ## 🔌 评估协议 T-Eval采用多轮对话形式评估模型能力，保存的提示词格式如下： python [ { "role": "system", "content": "You have access to the following API: {'name': 'AirbnbSearch.search_property_by_place', 'description': 'This function takes various parameters to search properties on Airbnb.', 'required_parameters': [{'name': 'place', 'type': 'STRING', 'description': 'The name of the destination.'}], 'optional_parameters': [], 'return_data': [{'name': 'property', 'description': 'a list of at most 3 properties, containing id, name, and address.'}]} Please generate the response in the following format: goal: goal to call this action name: api name to call args: JSON format api args in ONLY one line " }, { "role": "user", "content": "Call the function AirbnbSearch.search_property_by_place with the parameter as follows: 'place' is 'Berlin'." } ] 其中`role`可选值为`['system', 'user', 'assistant']`，`content`必须为字符串格式。在使用大语言模型进行推理前，我们需要通过`meta_template`将其转换为原始字符串格式。`meta_template`示例可参见[meta_template.py](teval/utils/meta_template.py)： python [ dict(role='system', begin='<|System|>:', end=' '), dict(role='user', begin='<|User|>:', end=' '), dict( role='assistant', begin='<|Bot|>:', end='<eoa> ', generate=True) ] 你需要根据待测试的HuggingFace模型，在[meta_template.py](teval/utils/meta_template.py)中配置`begin`与`end`标记，并在`test.py`中指定`meta_template`参数，参数值需与`meta_template.py`中设置的名称一致。对于OpenAI模型，我们将自动完成格式适配。 ## 📊 基准测试结果更详细全面的基准测试结果请参见🏆 [T-Eval官方评测榜单](https://open-compass.github.io/T-Eval/leaderboard.html)！ <div> <center> <img src="figs/teval_results.png"> </div> ### ✉️ 提交评测结果你可以将通过`test.py`生成的推理结果提交至邮箱[lovesnow@mail.ustc.edu.cn](mailto:lovesnow@mail.ustc.edu.cn)，我们将运行你的预测结果并更新评测榜单。同时请提供所测试模型的参数规模。提交的文件目录结构示例如下： $model_display_name/ instruct_$model_display_name/ query_0_1_0.json query_0_1_1.json ... plan_json_$model_display_name/ plan_str_$model_display_name/ ... ## ❤️ 致谢 T-Eval基于[Lagent](https://github.com/InternLM/lagent)与[OpenCompass](https://github.com/open-compass/opencompass)开发，感谢这两个优秀项目的贡献！ ## 🖊️ 引用格式如果你的研究中使用了本项目，请引用以下论文： @article{chen2023t, title={T-Eval: Evaluating the Tool Utilization Capability Step by Step}, author={Chen, Zehui and Du, Weihua and Zhang, Wenwei and Liu, Kuikun and Liu, Jiangning and Zheng, Miao and Zhuo, Jingming and Zhang, Songyang and Lin, Dahua and Chen, Kai and others}, journal={arXiv preprint arXiv:2312.14033}, year={2023} } ## 💳 许可证本项目采用Apache 2.0许可证开源，详见[LICENSE](./LICENSE)文件。

提供机构：

maas

创建时间：

2024-06-03

搜集汇总

数据集介绍