agentboard
收藏魔搭社区2025-11-17 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/hkust-nlp/agentboard
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<img src="./assets/agentboard.png" style="width: 20%;height: 10%">
<h1> AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents </h1>
</div>
This is the official dataset repository of [AgentBoard](https://github.com/hkust-nlp/agentboard).
## 1. Data Overview
AgentBoard is composed of 9 diverse tasks which can be divided into 4 types, including **Embodied AI**, **Game**, **Web**, and **Tool**:
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>Embodied AI</b>
</td>
<td>
<b>Game</b>
</td>
<td>
<b>Web</b>
</td>
<td>
<b>Tool</b>
</td>
</tr>
<tr valign="top">
<td>
- AlfWorld
- ScienceWorld
- BabyAI
</td>
<td>
- Jericho
- PDDL
</td>
<td>
- WebShop
- WebArena
</td>
<td>
- Tool-Query
- Tool-Operation
</td>
</tr>
</tbody>
</table>
And statistics of the evaluation data of 9 environments are as follows:
| | AlfWorld | ScienceWorld | BabyAI | Jericho | PDDL | WebShop | WebArena | Tool-Query | Tool-Operation |
|-------|----------|--------------|--------|---------|------|---------|----------|------------|----------------|
| **\#Environment** | 134 | 90 | 112 | 20 | 60 | 251 | 245 | 60 | 40 |
| **\#Turn** | 6 | 15 | 10 | 20 | 20 | 3 | 25 | 5 | 6 |
| **\#Action Space** | 13 | 21 | 8 | 150 | 8 | 2 | 12 | 15 | 16 |
| **\#Context Length** | 900 | 2800 | 1800 | 1500 | 2700 | 1200 | 15000 | 2100 | 4300 |
| **Progress Rate** | subgoal | subgoal | subgoal | subgoal | match | match | match | subgoal | subgoal/match |
| **\#Avg. Subgoals** | 3 | 5 | 4 | 6 | 6 | 4 | 6 | 5 | 5 |
| **Hard/Easy Cutoff** | 3 | 3 | 3 | 4 | 6 | 1 | 4 | 4 | 4 |
To help researchers quickly understand evaluation data of each task, we provide **Dataset Viewer** at Huggingface Dataset: [🤗 AgentBoard](https://huggingface.co/datasets/hkust-nlp/agentboard/).
> Note: Please download the dataset from the link provided below for the reason that the data in Dataset Viewer is not complete.
## 2. Download Link
You can download the whole evaluation data by running the following command:
```shell
wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz
```
Please uncommpress the file and move the data to `AgentBoard/data`.
```shell
cd AgentBoard
mkdir data
tar -zxvf data.tar.gz
```
The file structure of evaluation data is as follows:
<details>
<summary>
Click to expand the file structure
</summary>
```
data
├── alfworld
│ ├── alfred.pddl # additional data for alfworld
│ ├── alfred.twl2 # additional data for alfworld
│ ├── json_2.1.1 # additional data for alfworld
│ └── test.jsonl
├── babyai
│ └── test.jsonl
├── jericho
│ ├── test.jsonl
│ └── z-machine-games-master # additional data for jericho
├── pddl
│ └── test.jsonl
├── scienceworld
│ └── test.jsonl
├── tool-operation
│ └── test.jsonl
├── tool-query
│ ├── academia # additional data for academia tool
│ └── test.jsonl
├── webarena
│ └── test.jsonl
└── webshop
└── test.jsonl
```
</details>
## 3. Data Fields
We take an instance from the `ScienceWorld` task as an example to illustrate the data fields of evaluation data.
```json
{
"task": "scienceworld",
"id": 0,
"goal": "Your task is to find the animal with the longest life span. The animals are in the 'outside' location. Focus on the animal with the longest life span.",
"subgoals": ["You move to the outside.", "You focus on the crocodile egg."],
"difficulty": "easy",
"additional_info": {"var": 5, "env_name": "lifespan-longest-lived"}
}
```
Details of the data fields are as follows:
| Field Name | Description |
|------------|-------------|
| `task` | The task name of the example, e.g. `alfworld`, `babyai`, `jericho`, `pddl`, `scienceworld`, `tool-operation`, `tool-query`, `webarena`, `webshop`. |
| `id` | The id of the example. |
| `goal` | The goal of the example. |
| `subgoals` | The subgoals of the example. |
| `difficulty` | The difficulty of the example, e.g. `easy`, `hard`. |
| `additional_info` | The additional information of the example, each example has its own additional information. |
## 4. Citation
```bibtex
@misc{ma2024agentboard,
title={AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents},
author={Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He},
year={2024},
eprint={2401.13178},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
<div align="center">
<img src="./assets/agentboard.png" style="width: 20%;height: 10%">
<h1>AgentBoard:面向多轮大语言模型 (Large Language Model) 智能体的分析式评测平台</h1>
</div>
本仓库为AgentBoard的官方数据集存储库。
## 1. 数据概览
AgentBoard包含9类多样化任务,可划分为4大类别,分别为**具身智能 (Embodied AI)**、**游戏 (Game)**、**网页 (Web)**以及**工具 (Tool)**:
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>具身智能 (Embodied AI)</b>
</td>
<td>
<b>游戏 (Game)</b>
</td>
<td>
<b>网页 (Web)</b>
</td>
<td>
<b>工具 (Tool)</b>
</td>
</tr>
<tr valign="top">
<td>
- AlfWorld
- ScienceWorld
- BabyAI
</td>
<td>
- Jericho
- PDDL
</td>
<td>
- WebShop
- WebArena
</td>
<td>
- Tool-Query
- Tool-Operation
</td>
</tr>
</tbody>
</table>
9个评测环境的统计数据如下:
| | AlfWorld | ScienceWorld | BabyAI | Jericho | PDDL | WebShop | WebArena | Tool-Query | Tool-Operation |
|-------|----------|--------------|--------|---------|------|---------|----------|------------|----------------|
| **环境数量** | 134 | 90 | 112 | 20 | 60 | 251 | 245 | 60 | 40 |
| **交互轮次** | 6 | 15 | 10 | 20 | 20 | 3 | 25 | 5 | 6 |
| **动作空间规模** | 13 | 21 | 8 | 150 | 8 | 2 | 12 | 15 | 16 |
| **上下文长度** | 900 | 2800 | 1800 | 1500 | 2700 | 1200 | 15000 | 2100 | 4300 |
| **进度衡量方式** | 子目标进度 | 子目标进度 | 子目标进度 | 子目标进度 | 匹配进度 | 匹配进度 | 匹配进度 | 子目标进度 | 子目标/匹配进度 |
| **平均子目标数** | 3 | 5 | 4 | 6 | 6 | 4 | 6 | 5 | 5 |
| **难易分界阈值** | 3 | 3 | 3 | 4 | 6 | 1 | 4 | 4 | 4 |
为方便研究者快速了解各任务的评测数据,我们在Huggingface数据集平台上提供了**数据集查看器 (Dataset Viewer)**:[🤗 AgentBoard](https://huggingface.co/datasets/hkust-nlp/agentboard/)。
> 注意:由于数据集查看器中的数据并不完整,请通过下方提供的链接下载数据集。
## 2. 下载链接
可通过以下命令下载完整评测数据集:
shell
wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz
请解压文件并将数据移动至`AgentBoard/data`目录下:
shell
cd AgentBoard
mkdir data
tar -zxvf data.tar.gz
评测数据集的文件结构如下:
<details>
<summary>
点击展开文件结构
</summary>
data
├── alfworld
│ ├── alfred.pddl # AlfWorld 附加数据
│ ├── alfred.twl2 # AlfWorld 附加数据
│ ├── json_2.1.1 # AlfWorld 附加数据
│ └── test.jsonl
├── babyai
│ └── test.jsonl
├── jericho
│ ├── test.jsonl
│ └── z-machine-games-master # Jericho 附加数据
├── pddl
│ └── test.jsonl
├── scienceworld
│ └── test.jsonl
├── tool-operation
│ └── test.jsonl
├── tool-query
│ ├── academia # 学术工具 (academia) 附加数据
│ └── test.jsonl
├── webarena
│ └── test.jsonl
└── webshop
└── test.jsonl
</details>
## 3. 数据字段
我们以`ScienceWorld`任务中的一条样本为例,对评测数据的字段进行说明:
json
{
"task": "scienceworld",
"id": 0,
"goal": "Your task is to find the animal with the longest life span. The animals are in the 'outside' location. Focus on the animal with the longest life span.",
"subgoals": ["You move to the outside.", "You focus on the crocodile egg."],
"difficulty": "easy",
"additional_info": {"var": 5, "env_name": "lifespan-longest-lived"}
}
各数据字段的详细说明如下:
| 字段名 | 说明 |
|------------|-------------|
| `task` | 样本所属的任务名称,例如 `alfworld`、`babyai`、`jericho`、`pddl`、`scienceworld`、`tool-operation`、`tool-query`、`webarena`、`webshop`。 |
| `id` | 样本的唯一标识ID。 |
| `goal` | 样本的任务目标。 |
| `subgoals` | 样本包含的子目标序列。 |
| `difficulty` | 样本的难度等级,例如 `easy`(简单)、`hard`(困难)。 |
| `additional_info` | 样本的附加信息,每条样本均包含专属的附加信息。 |
## 4. 引用格式
bibtex
@misc{ma2024agentboard,
title={AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents},
author={Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He},
year={2024},
eprint={2401.13178},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
提供机构:
maas
创建时间:
2025-02-17



