agent-data-collection

Name: agent-data-collection
Creator: maas
Published: 2025-10-10T22:01:46+08:00

魔搭社区2026-06-29 更新2025-11-08 收录

人机对话

任务自动化

数据链接：

https://modelscope.cn/datasets/neulab/agent-data-collection 数据链接链接失效反馈

官方服务：

资源简介：

# Agent Data Collection A comprehensive collection of agent interaction datasets for training and evaluating AI agents across diverse domains and tasks. This dataset aggregates high-quality agent trajectories from various environments including web browsing, code generation, household tasks, knowledge base querying, and software engineering. The dataset is collected through methods described in [Agent Data Protocol](https://arxiv.org/abs/2510.24702). ## Dataset Splits Each dataset configuration provides up to different splits depending on availability: ### Split Types | Split | Description | File Path | |-------|-------------|-----------| | **`raw`** | Original unprocessed agent trajectories | `{dataset}/full_raw.jsonl` | | **`std`** | Standardized format with consistent structure | `{dataset}/full_std.jsonl` | | **`sft_openhands`** | Converted to OpenHands agent finetuning format | `{dataset}/full_sft/full_sft_openhands.jsonl` | | **`sft_sweagent`** | Converted to SWE-agent finetuning format | `{dataset}/full_sft/full_sft_sweagent.jsonl` | | **`sft_agentlab`** | Converted to AgentLab finetuning format | `{dataset}/full_sft/full_sft_agentlab.jsonl` | ## Repository Structure Each dataset in the collection follows a consistent structure: ``` dataset_name/ ├── README.md # Dataset-specific documentation ├── LICENSE # Dataset-specific license information ├── full_raw.jsonl # Original raw data format ├── full_std.jsonl # ADP standardized format └── full_sft/ # Agent-specific SFT formats ├── full_sft_openhands.jsonl # OpenHands agent format ├── full_sft_sweagent.jsonl # SWE-agent format └── full_sft_agentlab.jsonl # AgentLab format ``` ### File Descriptions - **`full_raw.jsonl`**: Contains the original dataset in its native format before any processing - **`full_std.jsonl`**: Standardized format following ADP schema with unified action/observation structure - **`full_sft/`**: Directory containing agent-specific training formats: - **`full_sft_openhands.jsonl`**: Formatted for [OpenHands](https://github.com/OpenHands/OpenHands) agent training - **`full_sft_sweagent.jsonl`**: Formatted for [SWE-agent](https://github.com/SWE-agent/SWE-agent) training - **`full_sft_agentlab.jsonl`**: Formatted for [AgentLab](https://github.com/ServiceNow/AgentLab) training ### Standardized Format (ADP Schema) The standardized format (`full_std.jsonl`) follows the Agent Data Protocol schema. Each example contains: ```json { "id": "unique_identifier", "content": [ { "class_": "text_observation", "content": "observation_text", "name": null, "source": "user" }, { "class_": "message_action", "content": "agent_message", "description": "optional_reasoning" }, { "class_": "api_action", "function": "function_name", "kwargs": {"param": "value"}, "description": "reasoning_for_action" }, ...... ], "details": {} } ``` **Key Components:** - **`id`**: Unique identifier for the interaction session - **`content`**: Sequential list of actions and observations in the agent trajectory - **`details`**: Additional metadata (typically empty) Please check out the paper for more details. ### SFT Format The SFT (Supervised Fine-Tuning) format (`full_sft/*.jsonl`) is optimized for training and follows a conversational structure: ```json { "id": "unique_identifier", "system": "system_prompt_defining_agent_behavior_and_available_functions", "conversations": [ { "from": "human", "value": "user_request_or_environment_observation" }, { "from": "gpt", "value": "agent_response_with_function_calls_in_xml_format" }, ...... ] } ``` **Key Components:** - **`id`**: Same identifier as in standardized format - **`system`**: Comprehensive system prompt - **`conversations`**: Alternating human/gpt turns representing the full interaction ### Real Examples **Standardized Format Example:** ```json { "id": "alfworld_0", "content": [ { "class_": "text_observation", "content": "Here is your task. You are in the middle of a room. Looking quickly around you, you see a armchair 2, a armchair 1, a bed 1, a countertop 1, a diningtable 1, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a laundryhamper 1, and a sidetable 1.\nYour task is to: find two laptop and put them in bed.", "name": null, "source": "user" }, { "class_": "api_action", "function": "go", "kwargs": {"location": "\"diningtable 1\""}, "description": "I need to find two laptops and put them in bed. I'll begin by searching the areas that are most likely to have laptops." }, { "class_": "text_observation", "content": "On the diningtable 1, you see a alarmclock 2, a bowl 2, a cd 2, a creditcard 2, a creditcard 1, a laptop 1, a pen 1, a pencil 2, and a remotecontrol 1.", "name": null, "source": "user" } ], "details": {} } ``` **SFT Format Example:** ```json { "id": "alfworld_0", "system": "You are a helpful assistant that can interact with a computer to solve tasks...", "conversations": [ { "from": "human", "value": "Here is your task. You are in the middle of a room. Looking quickly around you, you see a armchair 2, a armchair 1, a bed 1, a countertop 1, a diningtable 1, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a laundryhamper 1, and a sidetable 1.\nYour task is to: find two laptop and put them in bed." }, { "from": "gpt", "value": "<think>\nI need to find two laptops and put them in bed. I'll begin by searching the areas that are most likely to have laptops.\n</think>\n\n<function=bash>\n<parameter=command>\ngo(\"diningtable 1\")\n</parameter>\n</function>" }, { "from": "human", "value": "OBSERVATION:\nOn the diningtable 1, you see a alarmclock 2, a bowl 2, a cd 2, a creditcard 2, a creditcard 1, a laptop 1, a pen 1, a pencil 2, and a remotecontrol 1." } ] } ``` ## Usage Examples ### Loading Supervised Finetuning (SFT) Files with `data_files` Use the `data_files` parameter to load individual SFT files efficiently (downloads only the specified file): ```python from datasets import load_dataset # Load $agent specific SFT format for $dataset dataset = load_dataset( "neulab/agent-data-collection", data_files="{dataset}/full_sft/full_sft_{agent}.jsonl" ) # e.g. Load OpenHands SFT format for $dataset dataset = load_dataset( "neulab/agent-data-collection", data_files="{dataset}/full_sft/full_sft_openhands.jsonl" ) # e.g. Load SWE-Agent SFT format for $dataset dataset = load_dataset( "neulab/agent-data-collection", data_files="{dataset}/full_sft/full_sft_sweagent.jsonl" ) # e.g. Load AgentLab SFT format for $dataset dataset = load_dataset( "neulab/agent-data-collection", data_files="{dataset}/full_sft/full_sft_agentlab.jsonl" ) ``` #### Loading Multiple SFT Files You can also load multiple files at once: ```python # Load all SFT files for $agent dataset = load_dataset( "neulab/agent-data-collection", data_files="*/full_sft/full_sft_{agent}.jsonl" # Glob pattern ) ``` ### Downloading and Loading RAW / STD / SFT Files ```python import json from huggingface_hub import hf_hub_download def download(dataset, local_dir=None): """Manually download raw + std + sft files for $dataset.""" for f in ["full_raw.jsonl", "full_std.jsonl", "full_sft/full_sft_openhands.jsonl", "full_sft/full_sft_sweagent.jsonl", "full_sft/full_sft_agentlab.jsonl"]: try: hf_hub_download("neulab/agent-data-collection", filename=f"{dataset}/{f}", repo_type="dataset", local_dir=local_dir) except: continue def load(file_path): with open(file_path) as f: return [json.loads(line) for line in f.readlines()] ## Example Usage download("swe-smith", local_dir=".") print(load("./swe-smith/full_std.jsonl")[0]) ``` ## Data Curation The datasets in this collection were curated through a systematic three-stage pipeline: 1. **Raw Data Extraction**: Original datasets from various sources (research papers, existing repositories, synthetic generation), these are extracted and saved in `{dataset}/full_raw.jsonl`. 2. **Standardization**: Conversion to ADP's unified schema with standardized actions and observations, these are saved in `{dataset}/full_std.jsonl`. 3. **Agent-Specific Formatting**: Transformation into training-ready formats for specific agent frameworks, these are saved in `{dataset}/full_sft/*`. ## Licensing & Attribution This dataset collection aggregates data from multiple sources. Each subdataset retains its original license. Please refer to the `LICENSE` file in each dataset directory for specific licensing information. The sources of the datasets are documented in `README.md` under each dataset's directory. ## Contact and Support For questions, issues, or contributions: - **GitHub Issues**: [agent-data-protocol/issues](https://github.com/neulab/agent-data-protocol/issues) - **GitHub Discussions**: [agent-data-protocol/discussions](https://github.com/neulab/agent-data-protocol/discussions) - **Paper Authors**: Contact information available in the paper ### Contributing We welcome contributions to expand this collection! If you have high-quality agent interaction data that follows our format, please: 1. Ensure data quality and privacy compliance 2. Follow the standardized format 3. Include proper documentation and licensing 4. Submit a pull request with your dataset ## Citation If you use this dataset collection in your research, please cite: ```bibtex @article{song2025agent, title={Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents}, author={Song, Yueqi and Ramaneti, Ketan and Sheikh, Zaid and Chen, Ziru and Gou, Boyu and Xie, Tianbao and Xu, Yiheng and Zhang, Danyang and Gandhi, Apurva and Yang, Fan and others}, journal={arXiv preprint arXiv:2510.24702}, year={2025} } ``` ---

# 智能体数据集合集本合集为面向多领域、多任务下AI智能体（AI Agent）训练与评估的智能体交互数据集集合。本数据集聚合了来自网页浏览、代码生成、家庭任务、知识库查询以及软件工程等多种环境下的高质量智能体交互轨迹。数据集的采集方法遵循《Agent Data Protocol (ADP)》[论文链接](https://arxiv.org/abs/2510.24702)中所述的流程。 ## 数据集划分根据可用性，每个数据集配置提供至多数种划分方式： ### 划分类型 | 划分类型 | 描述 | 文件路径 | |-------|-------------|-----------| | **`raw`** | 原始未处理的智能体交互轨迹 | `{dataset}/full_raw.jsonl` | | **`std`** | 结构统一的标准化格式 | `{dataset}/full_std.jsonl` | | **`sft_openhands`** | 适配OpenHands智能体微调的格式 | `{dataset}/full_sft/full_sft_openhands.jsonl` | | **`sft_sweagent`** | 适配SWE-agent微调的格式 | `{dataset}/full_sft/full_sft_sweagent.jsonl` | | **`sft_agentlab`** | 适配AgentLab微调的格式 | `{dataset}/full_sft/full_sft_agentlab.jsonl` | ## 仓库目录结构本合集中的每个数据集均遵循统一的目录结构： dataset_name/ ├── README.md # 数据集专属说明文档 ├── LICENSE # 数据集专属许可证信息 ├── full_raw.jsonl # 原始原生格式数据集 ├── full_std.jsonl # ADP标准化格式数据集 └── full_sft/ # 智能体专属监督微调格式目录 ├── full_sft_openhands.jsonl # OpenHands智能体训练格式 ├── full_sft_sweagent.jsonl # SWE-agent训练格式 └── full_sft_agentlab.jsonl # AgentLab训练格式 ### 文件说明 - **`full_raw.jsonl`**：未经过任何预处理的原生格式原始数据集 - **`full_std.jsonl`**：遵循Agent Data Protocol (ADP) schema的标准化格式，具备统一的动作/观测结构 - **`full_sft/`**：存放智能体专属训练格式的目录： - **`full_sft_openhands.jsonl`**：适配[OpenHands](https://github.com/OpenHands/OpenHands)智能体训练的格式 - **`full_sft_sweagent.jsonl`**：适配[SWE-agent](https://github.com/SWE-agent/SWE-agent)训练的格式 - **`full_sft_agentlab.jsonl`**：适配[AgentLab](https://github.com/ServiceNow/AgentLab)训练的格式 ### 标准化格式（ADP Schema）标准化格式（`full_std.jsonl`）遵循Agent Data Protocol (ADP) schema。每个数据样本包含以下结构： json { "id": "唯一交互会话标识符", "content": [ { "class_": "text_observation", "content": "观测文本", "name": null, "source": "user" }, { "class_": "message_action", "content": "智能体回复消息", "description": "可选推理过程" }, { "class_": "api_action", "function": "函数名", "kwargs": {"参数名": "参数值"}, "description": "动作执行推理依据" }, ...... ], "details": {} } **核心组成部分：** - **`id`**：交互会话的唯一标识符 - **`content`**：按时间顺序排列的智能体交互轨迹中的动作与观测列表 - **`details`**：附加元数据（通常为空）如需了解更多细节，请查阅对应论文。 ### 监督微调格式（Supervised Fine-Tuning，SFT）监督微调（SFT）格式（`full_sft/*.jsonl`）经过优化以适配训练需求，采用对话式结构： json { "id": "唯一交互会话标识符", "system": "定义智能体行为与可用函数的系统提示词", "conversations": [ { "from": "human", "value": "用户请求或环境观测内容" }, { "from": "gpt", "value": "包含XML格式函数调用的智能体回复" }, ...... ] } **核心组成部分：** - **`id`**：与标准化格式中一致的会话标识符 - **`system`**：完整的系统提示词 - **`conversations`**：按交替顺序排列的人类用户与智能体对话轮次，完整还原交互过程 ### 实际示例 #### 标准化格式示例 json { "id": "alfworld_0", "content": [ { "class_": "text_observation", "content": "这是你的任务。你身处房间中央。快速环顾四周，你看到了扶手椅2、扶手椅1、床1、操作台1、餐桌1、抽屉2、抽屉1、梳妆台1、垃圾桶1、洗衣篮1和边桌1。你的任务是：找到两台笔记本电脑并将它们放到床上。", "name": null, "source": "user" }, { "class_": "api_action", "function": "go", "kwargs": {"location": ""diningtable 1""}, "description": "我需要找到两台笔记本电脑并放到床上。我将先从最有可能找到笔记本电脑的区域开始搜索。" }, { "class_": "text_observation", "content": "在餐桌1上，你看到了闹钟2、碗2、光盘2、信用卡2、信用卡1、笔记本电脑1、钢笔1、铅笔2和遥控器1。", "name": null, "source": "user" } ], "details": {} } #### 监督微调格式示例 json { "id": "alfworld_0", "system": "你是一位乐于助人的助手，可以通过交互计算机来完成任务...", "conversations": [ { "from": "human", "value": "这是你的任务。你身处房间中央。快速环顾四周，你看到了扶手椅2、扶手椅1、床1、操作台1、餐桌1、抽屉2、抽屉1、梳妆台1、垃圾桶1、洗衣篮1和边桌1。你的任务是：找到两台笔记本电脑并将它们放到床上。" }, { "from": "gpt", "value": "<think> 我需要找到两台笔记本电脑并放到床上。我将先从最有可能找到笔记本电脑的区域开始搜索。 </think> <function=bash> <parameter=command> go("diningtable 1") </parameter> </function>" }, { "from": "human", "value": "观测结果：在餐桌1上，你看到了闹钟2、碗2、光盘2、信用卡2、信用卡1、笔记本电脑1、钢笔1、铅笔2和遥控器1。" } ] } ## 使用示例 ### 使用`data_files`参数加载监督微调（SFT）文件通过`data_files`参数可以高效加载单个SFT文件（仅下载指定文件）： python from datasets import load_dataset # 加载指定数据集的对应智能体微调格式数据集 dataset = load_dataset( "neulab/agent-data-collection", data_files="{dataset}/full_sft/full_sft_{agent}.jsonl" ) # 示例：加载指定数据集的OpenHands微调格式数据集 dataset = load_dataset( "neulab/agent-data-collection", data_files="{dataset}/full_sft/full_sft_openhands.jsonl" ) # 示例：加载指定数据集的SWE-Agent微调格式数据集 dataset = load_dataset( "neulab/agent-data-collection", data_files="{dataset}/full_sft/full_sft_sweagent.jsonl" ) # 示例：加载指定数据集的AgentLab微调格式数据集 dataset = load_dataset( "neulab/agent-data-collection", data_files="{dataset}/full_sft/full_sft_agentlab.jsonl" ) #### 加载多个SFT文件 python # 加载指定智能体的所有SFT文件 dataset = load_dataset( "neulab/agent-data-collection", data_files="*/full_sft/full_sft_{agent}.jsonl" # 通配符模式 ) ### 下载并加载RAW、STD与SFT格式文件 python import json from huggingface_hub import hf_hub_download def download(dataset, local_dir=None): """手动下载指定数据集的RAW、STD与SFT格式文件。""" for f in ["full_raw.jsonl", "full_std.jsonl", "full_sft/full_sft_openhands.jsonl", "full_sft/full_sft_sweagent.jsonl", "full_sft/full_sft_agentlab.jsonl"]: try: hf_hub_download("neulab/agent-data-collection", filename=f"{dataset}/{f}", repo_type="dataset", local_dir=local_dir) except: continue def load(file_path): with open(file_path) as f: return [json.loads(line) for line in f.readlines()] ## 示例用法 download("swe-smith", local_dir=".") print(load("./swe-smith/full_std.jsonl")[0]) ## 数据整理流程本合集中的数据集通过系统化的三阶段流程整理得到： 1. **原始数据提取**：从各类来源（研究论文、现有仓库、合成生成数据）提取原始数据集，并保存至`{dataset}/full_raw.jsonl`。 2. **标准化处理**：将数据转换为ADP统一schema格式，具备标准化的动作与观测结构，保存至`{dataset}/full_std.jsonl`。 3. **智能体专属格式适配**：转换为特定智能体框架的训练就绪格式，保存至`{dataset}/full_sft/*`。 ## 许可证与署名本数据集合集聚合了多来源的数据，每个子数据集保留其原始许可证。请查阅每个数据集目录下的`LICENSE`文件以获取具体许可证信息。数据集的来源信息已记录在每个数据集目录下的`README.md`文件中。 ## 联系与支持如果有疑问、问题或贡献需求： - **GitHub 议题区**：[agent-data-protocol/issues](https://github.com/neulab/agent-data-protocol/issues) - **GitHub 讨论区**：[agent-data-protocol/discussions](https://github.com/neulab/agent-data-protocol/discussions) - **论文作者**：联系方式可在论文中获取 ### 贡献指南我们欢迎贡献以扩展本合集！如果您拥有符合本格式的高质量智能体交互数据，请遵循以下步骤： 1. 确保数据质量与隐私合规性 2. 遵循标准化格式要求 3. 提供完整的说明文档与许可证信息 4. 提交包含您的数据集的拉取请求 ## 引用格式如果您在研究中使用本数据集合集，请引用以下文献： bibtex @article{song2025agent, title={Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents}, author={Song, Yueqi and Ramaneti, Ketan and Sheikh, Zaid and Chen, Ziru and Gou, Boyu and Xie, Tianbao and Xu, Yiheng and Zhang, Danyang and Gandhi, Apurva and Yang, Fan and others}, journal={arXiv preprint arXiv:2510.24702}, year={2025} } ---

提供机构：

maas

创建时间：

2025-10-10

搜集汇总

数据集介绍