five

agentboard

收藏
魔搭社区2025-11-17 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/hkust-nlp/agentboard
下载链接
链接失效反馈
官方服务:
资源简介:
<div align="center"> <img src="./assets/agentboard.png" style="width: 20%;height: 10%"> <h1> AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents </h1> </div> This is the official dataset repository of [AgentBoard](https://github.com/hkust-nlp/agentboard). ## 1. Data Overview AgentBoard is composed of 9 diverse tasks which can be divided into 4 types, including **Embodied AI**, **Game**, **Web**, and **Tool**: <table align="center"> <tbody> <tr align="center" valign="bottom"> <td> <b>Embodied AI</b> </td> <td> <b>Game</b> </td> <td> <b>Web</b> </td> <td> <b>Tool</b> </td> </tr> <tr valign="top"> <td> - AlfWorld - ScienceWorld - BabyAI </td> <td> - Jericho - PDDL </td> <td> - WebShop - WebArena </td> <td> - Tool-Query - Tool-Operation </td> </tr> </tbody> </table> And statistics of the evaluation data of 9 environments are as follows: | | AlfWorld | ScienceWorld | BabyAI | Jericho | PDDL | WebShop | WebArena | Tool-Query | Tool-Operation | |-------|----------|--------------|--------|---------|------|---------|----------|------------|----------------| | **\#Environment** | 134 | 90 | 112 | 20 | 60 | 251 | 245 | 60 | 40 | | **\#Turn** | 6 | 15 | 10 | 20 | 20 | 3 | 25 | 5 | 6 | | **\#Action Space** | 13 | 21 | 8 | 150 | 8 | 2 | 12 | 15 | 16 | | **\#Context Length** | 900 | 2800 | 1800 | 1500 | 2700 | 1200 | 15000 | 2100 | 4300 | | **Progress Rate** | subgoal | subgoal | subgoal | subgoal | match | match | match | subgoal | subgoal/match | | **\#Avg. Subgoals** | 3 | 5 | 4 | 6 | 6 | 4 | 6 | 5 | 5 | | **Hard/Easy Cutoff** | 3 | 3 | 3 | 4 | 6 | 1 | 4 | 4 | 4 | To help researchers quickly understand evaluation data of each task, we provide **Dataset Viewer** at Huggingface Dataset: [🤗 AgentBoard](https://huggingface.co/datasets/hkust-nlp/agentboard/). > Note: Please download the dataset from the link provided below for the reason that the data in Dataset Viewer is not complete. ## 2. Download Link You can download the whole evaluation data by running the following command: ```shell wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz ``` Please uncommpress the file and move the data to `AgentBoard/data`. ```shell cd AgentBoard mkdir data tar -zxvf data.tar.gz ``` The file structure of evaluation data is as follows: <details> <summary> Click to expand the file structure </summary> ``` data ├── alfworld │ ├── alfred.pddl # additional data for alfworld │ ├── alfred.twl2 # additional data for alfworld │ ├── json_2.1.1 # additional data for alfworld │ └── test.jsonl ├── babyai │ └── test.jsonl ├── jericho │ ├── test.jsonl │ └── z-machine-games-master # additional data for jericho ├── pddl │ └── test.jsonl ├── scienceworld │ └── test.jsonl ├── tool-operation │ └── test.jsonl ├── tool-query │ ├── academia # additional data for academia tool │ └── test.jsonl ├── webarena │ └── test.jsonl └── webshop └── test.jsonl ``` </details> ## 3. Data Fields We take an instance from the `ScienceWorld` task as an example to illustrate the data fields of evaluation data. ```json { "task": "scienceworld", "id": 0, "goal": "Your task is to find the animal with the longest life span. The animals are in the 'outside' location. Focus on the animal with the longest life span.", "subgoals": ["You move to the outside.", "You focus on the crocodile egg."], "difficulty": "easy", "additional_info": {"var": 5, "env_name": "lifespan-longest-lived"} } ``` Details of the data fields are as follows: | Field Name | Description | |------------|-------------| | `task` | The task name of the example, e.g. `alfworld`, `babyai`, `jericho`, `pddl`, `scienceworld`, `tool-operation`, `tool-query`, `webarena`, `webshop`. | | `id` | The id of the example. | | `goal` | The goal of the example. | | `subgoals` | The subgoals of the example. | | `difficulty` | The difficulty of the example, e.g. `easy`, `hard`. | | `additional_info` | The additional information of the example, each example has its own additional information. | ## 4. Citation ```bibtex @misc{ma2024agentboard, title={AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents}, author={Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He}, year={2024}, eprint={2401.13178}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

<div align="center"> <img src="./assets/agentboard.png" style="width: 20%;height: 10%"> <h1>AgentBoard:面向多轮大语言模型 (Large Language Model) 智能体的分析式评测平台</h1> </div> 本仓库为AgentBoard的官方数据集存储库。 ## 1. 数据概览 AgentBoard包含9类多样化任务,可划分为4大类别,分别为**具身智能 (Embodied AI)**、**游戏 (Game)**、**网页 (Web)**以及**工具 (Tool)**: <table align="center"> <tbody> <tr align="center" valign="bottom"> <td> <b>具身智能 (Embodied AI)</b> </td> <td> <b>游戏 (Game)</b> </td> <td> <b>网页 (Web)</b> </td> <td> <b>工具 (Tool)</b> </td> </tr> <tr valign="top"> <td> - AlfWorld - ScienceWorld - BabyAI </td> <td> - Jericho - PDDL </td> <td> - WebShop - WebArena </td> <td> - Tool-Query - Tool-Operation </td> </tr> </tbody> </table> 9个评测环境的统计数据如下: | | AlfWorld | ScienceWorld | BabyAI | Jericho | PDDL | WebShop | WebArena | Tool-Query | Tool-Operation | |-------|----------|--------------|--------|---------|------|---------|----------|------------|----------------| | **环境数量** | 134 | 90 | 112 | 20 | 60 | 251 | 245 | 60 | 40 | | **交互轮次** | 6 | 15 | 10 | 20 | 20 | 3 | 25 | 5 | 6 | | **动作空间规模** | 13 | 21 | 8 | 150 | 8 | 2 | 12 | 15 | 16 | | **上下文长度** | 900 | 2800 | 1800 | 1500 | 2700 | 1200 | 15000 | 2100 | 4300 | | **进度衡量方式** | 子目标进度 | 子目标进度 | 子目标进度 | 子目标进度 | 匹配进度 | 匹配进度 | 匹配进度 | 子目标进度 | 子目标/匹配进度 | | **平均子目标数** | 3 | 5 | 4 | 6 | 6 | 4 | 6 | 5 | 5 | | **难易分界阈值** | 3 | 3 | 3 | 4 | 6 | 1 | 4 | 4 | 4 | 为方便研究者快速了解各任务的评测数据,我们在Huggingface数据集平台上提供了**数据集查看器 (Dataset Viewer)**:[🤗 AgentBoard](https://huggingface.co/datasets/hkust-nlp/agentboard/)。 > 注意:由于数据集查看器中的数据并不完整,请通过下方提供的链接下载数据集。 ## 2. 下载链接 可通过以下命令下载完整评测数据集: shell wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz 请解压文件并将数据移动至`AgentBoard/data`目录下: shell cd AgentBoard mkdir data tar -zxvf data.tar.gz 评测数据集的文件结构如下: <details> <summary> 点击展开文件结构 </summary> data ├── alfworld │ ├── alfred.pddl # AlfWorld 附加数据 │ ├── alfred.twl2 # AlfWorld 附加数据 │ ├── json_2.1.1 # AlfWorld 附加数据 │ └── test.jsonl ├── babyai │ └── test.jsonl ├── jericho │ ├── test.jsonl │ └── z-machine-games-master # Jericho 附加数据 ├── pddl │ └── test.jsonl ├── scienceworld │ └── test.jsonl ├── tool-operation │ └── test.jsonl ├── tool-query │ ├── academia # 学术工具 (academia) 附加数据 │ └── test.jsonl ├── webarena │ └── test.jsonl └── webshop └── test.jsonl </details> ## 3. 数据字段 我们以`ScienceWorld`任务中的一条样本为例,对评测数据的字段进行说明: json { "task": "scienceworld", "id": 0, "goal": "Your task is to find the animal with the longest life span. The animals are in the 'outside' location. Focus on the animal with the longest life span.", "subgoals": ["You move to the outside.", "You focus on the crocodile egg."], "difficulty": "easy", "additional_info": {"var": 5, "env_name": "lifespan-longest-lived"} } 各数据字段的详细说明如下: | 字段名 | 说明 | |------------|-------------| | `task` | 样本所属的任务名称,例如 `alfworld`、`babyai`、`jericho`、`pddl`、`scienceworld`、`tool-operation`、`tool-query`、`webarena`、`webshop`。 | | `id` | 样本的唯一标识ID。 | | `goal` | 样本的任务目标。 | | `subgoals` | 样本包含的子目标序列。 | | `difficulty` | 样本的难度等级,例如 `easy`(简单)、`hard`(困难)。 | | `additional_info` | 样本的附加信息,每条样本均包含专属的附加信息。 | ## 4. 引用格式 bibtex @misc{ma2024agentboard, title={AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents}, author={Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He}, year={2024}, eprint={2401.13178}, archivePrefix={arXiv}, primaryClass={cs.CL} }
提供机构:
maas
创建时间:
2025-02-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作