agent-data-collection
收藏魔搭社区2026-01-09 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/neulab/agent-data-collection
下载链接
链接失效反馈官方服务:
资源简介:
# Agent Data Collection
A comprehensive collection of agent interaction datasets for training and evaluating AI agents across diverse domains and tasks.
This dataset aggregates high-quality agent trajectories from various environments including web browsing, code generation, household tasks, knowledge base querying, and software engineering.
The dataset is collected through methods described in [Agent Data Protocol](https://arxiv.org/abs/2510.24702).
## Dataset Splits
Each dataset configuration provides up to different splits depending on availability:
### Split Types
| Split | Description | File Path |
|-------|-------------|-----------|
| **`raw`** | Original unprocessed agent trajectories | `{dataset}/full_raw.jsonl` |
| **`std`** | Standardized format with consistent structure | `{dataset}/full_std.jsonl` |
| **`sft_openhands`** | Converted to OpenHands agent finetuning format | `{dataset}/full_sft/full_sft_openhands.jsonl` |
| **`sft_sweagent`** | Converted to SWE-agent finetuning format | `{dataset}/full_sft/full_sft_sweagent.jsonl` |
| **`sft_agentlab`** | Converted to AgentLab finetuning format | `{dataset}/full_sft/full_sft_agentlab.jsonl` |
## Repository Structure
Each dataset in the collection follows a consistent structure:
```
dataset_name/
├── README.md # Dataset-specific documentation
├── LICENSE # Dataset-specific license information
├── full_raw.jsonl # Original raw data format
├── full_std.jsonl # ADP standardized format
└── full_sft/ # Agent-specific SFT formats
├── full_sft_openhands.jsonl # OpenHands agent format
├── full_sft_sweagent.jsonl # SWE-agent format
└── full_sft_agentlab.jsonl # AgentLab format
```
### File Descriptions
- **`full_raw.jsonl`**: Contains the original dataset in its native format before any processing
- **`full_std.jsonl`**: Standardized format following ADP schema with unified action/observation structure
- **`full_sft/`**: Directory containing agent-specific training formats:
- **`full_sft_openhands.jsonl`**: Formatted for [OpenHands](https://github.com/OpenHands/OpenHands) agent training
- **`full_sft_sweagent.jsonl`**: Formatted for [SWE-agent](https://github.com/SWE-agent/SWE-agent) training
- **`full_sft_agentlab.jsonl`**: Formatted for [AgentLab](https://github.com/ServiceNow/AgentLab) training
### Standardized Format (ADP Schema)
The standardized format (`full_std.jsonl`) follows the Agent Data Protocol schema. Each example contains:
```json
{
"id": "unique_identifier",
"content": [
{
"class_": "text_observation",
"content": "observation_text",
"name": null,
"source": "user"
},
{
"class_": "message_action",
"content": "agent_message",
"description": "optional_reasoning"
},
{
"class_": "api_action",
"function": "function_name",
"kwargs": {"param": "value"},
"description": "reasoning_for_action"
},
......
],
"details": {}
}
```
**Key Components:**
- **`id`**: Unique identifier for the interaction session
- **`content`**: Sequential list of actions and observations in the agent trajectory
- **`details`**: Additional metadata (typically empty)
Please check out the paper for more details.
### SFT Format
The SFT (Supervised Fine-Tuning) format (`full_sft/*.jsonl`) is optimized for training and follows a conversational structure:
```json
{
"id": "unique_identifier",
"system": "system_prompt_defining_agent_behavior_and_available_functions",
"conversations": [
{
"from": "human",
"value": "user_request_or_environment_observation"
},
{
"from": "gpt",
"value": "agent_response_with_function_calls_in_xml_format"
},
......
]
}
```
**Key Components:**
- **`id`**: Same identifier as in standardized format
- **`system`**: Comprehensive system prompt
- **`conversations`**: Alternating human/gpt turns representing the full interaction
### Real Examples
**Standardized Format Example:**
```json
{
"id": "alfworld_0",
"content": [
{
"class_": "text_observation",
"content": "Here is your task. You are in the middle of a room. Looking quickly around you, you see a armchair 2, a armchair 1, a bed 1, a countertop 1, a diningtable 1, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a laundryhamper 1, and a sidetable 1.\nYour task is to: find two laptop and put them in bed.",
"name": null,
"source": "user"
},
{
"class_": "api_action",
"function": "go",
"kwargs": {"location": "\"diningtable 1\""},
"description": "I need to find two laptops and put them in bed. I'll begin by searching the areas that are most likely to have laptops."
},
{
"class_": "text_observation",
"content": "On the diningtable 1, you see a alarmclock 2, a bowl 2, a cd 2, a creditcard 2, a creditcard 1, a laptop 1, a pen 1, a pencil 2, and a remotecontrol 1.",
"name": null,
"source": "user"
}
],
"details": {}
}
```
**SFT Format Example:**
```json
{
"id": "alfworld_0",
"system": "You are a helpful assistant that can interact with a computer to solve tasks...",
"conversations": [
{
"from": "human",
"value": "Here is your task. You are in the middle of a room. Looking quickly around you, you see a armchair 2, a armchair 1, a bed 1, a countertop 1, a diningtable 1, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a laundryhamper 1, and a sidetable 1.\nYour task is to: find two laptop and put them in bed."
},
{
"from": "gpt",
"value": "<think>\nI need to find two laptops and put them in bed. I'll begin by searching the areas that are most likely to have laptops.\n</think>\n\n<function=bash>\n<parameter=command>\ngo(\"diningtable 1\")\n</parameter>\n</function>"
},
{
"from": "human",
"value": "OBSERVATION:\nOn the diningtable 1, you see a alarmclock 2, a bowl 2, a cd 2, a creditcard 2, a creditcard 1, a laptop 1, a pen 1, a pencil 2, and a remotecontrol 1."
}
]
}
```
## Usage Examples
### Loading Supervised Finetuning (SFT) Files with `data_files`
Use the `data_files` parameter to load individual SFT files efficiently (downloads only the specified file):
```python
from datasets import load_dataset
# Load $agent specific SFT format for $dataset
dataset = load_dataset(
"neulab/agent-data-collection",
data_files="{dataset}/full_sft/full_sft_{agent}.jsonl"
)
# e.g. Load OpenHands SFT format for $dataset
dataset = load_dataset(
"neulab/agent-data-collection",
data_files="{dataset}/full_sft/full_sft_openhands.jsonl"
)
# e.g. Load SWE-Agent SFT format for $dataset
dataset = load_dataset(
"neulab/agent-data-collection",
data_files="{dataset}/full_sft/full_sft_sweagent.jsonl"
)
# e.g. Load AgentLab SFT format for $dataset
dataset = load_dataset(
"neulab/agent-data-collection",
data_files="{dataset}/full_sft/full_sft_agentlab.jsonl"
)
```
#### Loading Multiple SFT Files
You can also load multiple files at once:
```python
# Load all SFT files for $agent
dataset = load_dataset(
"neulab/agent-data-collection",
data_files="*/full_sft/full_sft_{agent}.jsonl" # Glob pattern
)
```
### Downloading and Loading RAW / STD / SFT Files
```python
import json
from huggingface_hub import hf_hub_download
def download(dataset, local_dir=None):
"""Manually download raw + std + sft files for $dataset."""
for f in ["full_raw.jsonl", "full_std.jsonl", "full_sft/full_sft_openhands.jsonl", "full_sft/full_sft_sweagent.jsonl", "full_sft/full_sft_agentlab.jsonl"]:
try: hf_hub_download("neulab/agent-data-collection", filename=f"{dataset}/{f}", repo_type="dataset", local_dir=local_dir)
except: continue
def load(file_path):
with open(file_path) as f:
return [json.loads(line) for line in f.readlines()]
## Example Usage
download("swe-smith", local_dir=".")
print(load("./swe-smith/full_std.jsonl")[0])
```
## Data Curation
The datasets in this collection were curated through a systematic three-stage pipeline:
1. **Raw Data Extraction**: Original datasets from various sources (research papers, existing repositories, synthetic generation), these are extracted and saved in `{dataset}/full_raw.jsonl`.
2. **Standardization**: Conversion to ADP's unified schema with standardized actions and observations, these are saved in `{dataset}/full_std.jsonl`.
3. **Agent-Specific Formatting**: Transformation into training-ready formats for specific agent frameworks, these are saved in `{dataset}/full_sft/*`.
## Licensing & Attribution
This dataset collection aggregates data from multiple sources. Each subdataset retains its original license.
Please refer to the `LICENSE` file in each dataset directory for specific licensing information.
The sources of the datasets are documented in `README.md` under each dataset's directory.
## Contact and Support
For questions, issues, or contributions:
- **GitHub Issues**: [agent-data-protocol/issues](https://github.com/neulab/agent-data-protocol/issues)
- **GitHub Discussions**: [agent-data-protocol/discussions](https://github.com/neulab/agent-data-protocol/discussions)
- **Paper Authors**: Contact information available in the paper
### Contributing
We welcome contributions to expand this collection! If you have high-quality agent interaction data that follows our format, please:
1. Ensure data quality and privacy compliance
2. Follow the standardized format
3. Include proper documentation and licensing
4. Submit a pull request with your dataset
## Citation
If you use this dataset collection in your research, please cite:
```bibtex
@article{song2025agent,
title={Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents},
author={Song, Yueqi and Ramaneti, Ketan and Sheikh, Zaid and Chen, Ziru and Gou, Boyu and Xie, Tianbao and Xu, Yiheng and Zhang, Danyang and Gandhi, Apurva and Yang, Fan and others},
journal={arXiv preprint arXiv:2510.24702},
year={2025}
}
```
---
# 智能体数据集合集
本合集为面向多领域、多任务下AI智能体(AI Agent)训练与评估的智能体交互数据集集合。本数据集聚合了来自网页浏览、代码生成、家庭任务、知识库查询以及软件工程等多种环境下的高质量智能体交互轨迹。数据集的采集方法遵循《Agent Data Protocol (ADP)》[论文链接](https://arxiv.org/abs/2510.24702)中所述的流程。
## 数据集划分
根据可用性,每个数据集配置提供至多数种划分方式:
### 划分类型
| 划分类型 | 描述 | 文件路径 |
|-------|-------------|-----------|
| **`raw`** | 原始未处理的智能体交互轨迹 | `{dataset}/full_raw.jsonl` |
| **`std`** | 结构统一的标准化格式 | `{dataset}/full_std.jsonl` |
| **`sft_openhands`** | 适配OpenHands智能体微调的格式 | `{dataset}/full_sft/full_sft_openhands.jsonl` |
| **`sft_sweagent`** | 适配SWE-agent微调的格式 | `{dataset}/full_sft/full_sft_sweagent.jsonl` |
| **`sft_agentlab`** | 适配AgentLab微调的格式 | `{dataset}/full_sft/full_sft_agentlab.jsonl` |
## 仓库目录结构
本合集中的每个数据集均遵循统一的目录结构:
dataset_name/
├── README.md # 数据集专属说明文档
├── LICENSE # 数据集专属许可证信息
├── full_raw.jsonl # 原始原生格式数据集
├── full_std.jsonl # ADP标准化格式数据集
└── full_sft/ # 智能体专属监督微调格式目录
├── full_sft_openhands.jsonl # OpenHands智能体训练格式
├── full_sft_sweagent.jsonl # SWE-agent训练格式
└── full_sft_agentlab.jsonl # AgentLab训练格式
### 文件说明
- **`full_raw.jsonl`**:未经过任何预处理的原生格式原始数据集
- **`full_std.jsonl`**:遵循Agent Data Protocol (ADP) schema的标准化格式,具备统一的动作/观测结构
- **`full_sft/`**:存放智能体专属训练格式的目录:
- **`full_sft_openhands.jsonl`**:适配[OpenHands](https://github.com/OpenHands/OpenHands)智能体训练的格式
- **`full_sft_sweagent.jsonl`**:适配[SWE-agent](https://github.com/SWE-agent/SWE-agent)训练的格式
- **`full_sft_agentlab.jsonl`**:适配[AgentLab](https://github.com/ServiceNow/AgentLab)训练的格式
### 标准化格式(ADP Schema)
标准化格式(`full_std.jsonl`)遵循Agent Data Protocol (ADP) schema。每个数据样本包含以下结构:
json
{
"id": "唯一交互会话标识符",
"content": [
{
"class_": "text_observation",
"content": "观测文本",
"name": null,
"source": "user"
},
{
"class_": "message_action",
"content": "智能体回复消息",
"description": "可选推理过程"
},
{
"class_": "api_action",
"function": "函数名",
"kwargs": {"参数名": "参数值"},
"description": "动作执行推理依据"
},
......
],
"details": {}
}
**核心组成部分:**
- **`id`**:交互会话的唯一标识符
- **`content`**:按时间顺序排列的智能体交互轨迹中的动作与观测列表
- **`details`**:附加元数据(通常为空)
如需了解更多细节,请查阅对应论文。
### 监督微调格式(Supervised Fine-Tuning,SFT)
监督微调(SFT)格式(`full_sft/*.jsonl`)经过优化以适配训练需求,采用对话式结构:
json
{
"id": "唯一交互会话标识符",
"system": "定义智能体行为与可用函数的系统提示词",
"conversations": [
{
"from": "human",
"value": "用户请求或环境观测内容"
},
{
"from": "gpt",
"value": "包含XML格式函数调用的智能体回复"
},
......
]
}
**核心组成部分:**
- **`id`**:与标准化格式中一致的会话标识符
- **`system`**:完整的系统提示词
- **`conversations`**:按交替顺序排列的人类用户与智能体对话轮次,完整还原交互过程
### 实际示例
#### 标准化格式示例
json
{
"id": "alfworld_0",
"content": [
{
"class_": "text_observation",
"content": "这是你的任务。你身处房间中央。快速环顾四周,你看到了扶手椅2、扶手椅1、床1、操作台1、餐桌1、抽屉2、抽屉1、梳妆台1、垃圾桶1、洗衣篮1和边桌1。
你的任务是:找到两台笔记本电脑并将它们放到床上。",
"name": null,
"source": "user"
},
{
"class_": "api_action",
"function": "go",
"kwargs": {"location": ""diningtable 1""},
"description": "我需要找到两台笔记本电脑并放到床上。我将先从最有可能找到笔记本电脑的区域开始搜索。"
},
{
"class_": "text_observation",
"content": "在餐桌1上,你看到了闹钟2、碗2、光盘2、信用卡2、信用卡1、笔记本电脑1、钢笔1、铅笔2和遥控器1。",
"name": null,
"source": "user"
}
],
"details": {}
}
#### 监督微调格式示例
json
{
"id": "alfworld_0",
"system": "你是一位乐于助人的助手,可以通过交互计算机来完成任务...",
"conversations": [
{
"from": "human",
"value": "这是你的任务。你身处房间中央。快速环顾四周,你看到了扶手椅2、扶手椅1、床1、操作台1、餐桌1、抽屉2、抽屉1、梳妆台1、垃圾桶1、洗衣篮1和边桌1。
你的任务是:找到两台笔记本电脑并将它们放到床上。"
},
{
"from": "gpt",
"value": "<think>
我需要找到两台笔记本电脑并放到床上。我将先从最有可能找到笔记本电脑的区域开始搜索。
</think>
<function=bash>
<parameter=command>
go("diningtable 1")
</parameter>
</function>"
},
{
"from": "human",
"value": "观测结果:
在餐桌1上,你看到了闹钟2、碗2、光盘2、信用卡2、信用卡1、笔记本电脑1、钢笔1、铅笔2和遥控器1。"
}
]
}
## 使用示例
### 使用`data_files`参数加载监督微调(SFT)文件
通过`data_files`参数可以高效加载单个SFT文件(仅下载指定文件):
python
from datasets import load_dataset
# 加载指定数据集的对应智能体微调格式数据集
dataset = load_dataset(
"neulab/agent-data-collection",
data_files="{dataset}/full_sft/full_sft_{agent}.jsonl"
)
# 示例:加载指定数据集的OpenHands微调格式数据集
dataset = load_dataset(
"neulab/agent-data-collection",
data_files="{dataset}/full_sft/full_sft_openhands.jsonl"
)
# 示例:加载指定数据集的SWE-Agent微调格式数据集
dataset = load_dataset(
"neulab/agent-data-collection",
data_files="{dataset}/full_sft/full_sft_sweagent.jsonl"
)
# 示例:加载指定数据集的AgentLab微调格式数据集
dataset = load_dataset(
"neulab/agent-data-collection",
data_files="{dataset}/full_sft/full_sft_agentlab.jsonl"
)
#### 加载多个SFT文件
python
# 加载指定智能体的所有SFT文件
dataset = load_dataset(
"neulab/agent-data-collection",
data_files="*/full_sft/full_sft_{agent}.jsonl" # 通配符模式
)
### 下载并加载RAW、STD与SFT格式文件
python
import json
from huggingface_hub import hf_hub_download
def download(dataset, local_dir=None):
"""手动下载指定数据集的RAW、STD与SFT格式文件。"""
for f in ["full_raw.jsonl", "full_std.jsonl", "full_sft/full_sft_openhands.jsonl", "full_sft/full_sft_sweagent.jsonl", "full_sft/full_sft_agentlab.jsonl"]:
try: hf_hub_download("neulab/agent-data-collection", filename=f"{dataset}/{f}", repo_type="dataset", local_dir=local_dir)
except: continue
def load(file_path):
with open(file_path) as f:
return [json.loads(line) for line in f.readlines()]
## 示例用法
download("swe-smith", local_dir=".")
print(load("./swe-smith/full_std.jsonl")[0])
## 数据整理流程
本合集中的数据集通过系统化的三阶段流程整理得到:
1. **原始数据提取**:从各类来源(研究论文、现有仓库、合成生成数据)提取原始数据集,并保存至`{dataset}/full_raw.jsonl`。
2. **标准化处理**:将数据转换为ADP统一schema格式,具备标准化的动作与观测结构,保存至`{dataset}/full_std.jsonl`。
3. **智能体专属格式适配**:转换为特定智能体框架的训练就绪格式,保存至`{dataset}/full_sft/*`。
## 许可证与署名
本数据集合集聚合了多来源的数据,每个子数据集保留其原始许可证。
请查阅每个数据集目录下的`LICENSE`文件以获取具体许可证信息。
数据集的来源信息已记录在每个数据集目录下的`README.md`文件中。
## 联系与支持
如果有疑问、问题或贡献需求:
- **GitHub 议题区**:[agent-data-protocol/issues](https://github.com/neulab/agent-data-protocol/issues)
- **GitHub 讨论区**:[agent-data-protocol/discussions](https://github.com/neulab/agent-data-protocol/discussions)
- **论文作者**:联系方式可在论文中获取
### 贡献指南
我们欢迎贡献以扩展本合集!如果您拥有符合本格式的高质量智能体交互数据,请遵循以下步骤:
1. 确保数据质量与隐私合规性
2. 遵循标准化格式要求
3. 提供完整的说明文档与许可证信息
4. 提交包含您的数据集的拉取请求
## 引用格式
如果您在研究中使用本数据集合集,请引用以下文献:
bibtex
@article{song2025agent,
title={Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents},
author={Song, Yueqi and Ramaneti, Ketan and Sheikh, Zaid and Chen, Ziru and Gou, Boyu and Xie, Tianbao and Xu, Yiheng and Zhang, Danyang and Gandhi, Apurva and Yang, Fan and others},
journal={arXiv preprint arXiv:2510.24702},
year={2025}
}
---
提供机构:
maas
创建时间:
2025-10-10
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个高质量的代理交互数据集集合,覆盖了网页浏览、代码生成、家庭任务等多个领域,提供了原始、标准化及特定代理训练格式的数据分割,适用于OpenHands、SWE-agent和AgentLab等多种代理框架的训练和评估。
以上内容由遇见数据集搜集并总结生成



