AgentIF

Name: AgentIF
Creator: maas
Published: 2025-10-31 16:39:38
License: 暂无描述

魔搭社区2025-10-31 更新2025-07-19 收录

下载链接：

https://modelscope.cn/datasets/THU-KEG/AgentIF

下载链接

链接失效反馈

官方服务：

资源简介：

# AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios We introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. ## How to evaluation 1. Specify the target model and the evaluator in the `run.sh` file. We recommend using `gpt-4o-2024-11-20` to reproduce our results. ``` Model_Name="" Model_Name_URL="" Model_Name_API_Key="EMPTY" Evaluator_Model_Backbone="" Evaluator_URL="" Evaluator_API_Key="" ``` 2. Then run the script to start the evaluation. ``` sh run.sh ```

# AGENTIF：智能体场景下大语言模型（Large Language Model，LLM）指令遵循能力基准测试集本研究提出AGENTIF，这是首个用于系统性评估大语言模型在智能体场景下指令遵循能力的基准测试集。AGENTIF具备三大核心特性：(1) 真实性：数据集源自50个真实落地的智能体应用场景；(2) 长文本特性：单条指令平均长度为1723词，最长可达15630词；(3) 复杂性：每条指令平均包含11.9项约束条件，涵盖工具规范、条件约束等多种约束类型。为构建AGENTIF数据集，研究团队从工业级应用智能体与开源智能体系统的50项智能体任务中，收集了707条经人工标注的指令。针对每条指令，团队均标注了其关联的约束条件与对应的评估指标，评估指标涵盖基于代码的评估、基于大语言模型的评估以及代码-大语言模型混合评估三种方式。 ## 评估流程 1. 在`run.sh`配置文件中指定目标模型与评估模型。本研究推荐使用`gpt-4o-2024-11-20`以复现实验结果。 Model_Name="" Model_Name_URL="" Model_Name_API_Key="EMPTY" Evaluator_Model_Backbone="" Evaluator_URL="" Evaluator_API_Key="" 2. 执行脚本以启动评估流程。 sh run.sh

提供机构：

maas

创建时间：

2025-07-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集