APIGen-MT-5k

Name: APIGen-MT-5k
Creator: maas
Published: 2026-05-20 15:22:05
License: 暂无描述

魔搭社区2026-05-20 更新2025-05-17 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/APIGen-MT-5k

下载链接

链接失效反馈

官方服务：

资源简介：

## Summary - [APIGen-MT](https://apigen-mt.github.io/) is an automated agentic data generation pipeline designed to synthesize *verifiable, high-quality, realistic datasets* for agentic applications - This dataset was released as part of [APIGen-MT: Agentic PIpeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay](https://arxiv.org/abs/2504.03601) - Code: [https://github.com/apigen-mt/apigen-mt.github.io](https://github.com/apigen-mt/apigen-mt.github.io) - The repo contains **5000** multi-turn trajectories collected by APIGen-MT - This dataset is a subset of the data used to train the [xLAM-2](https://huggingface.co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4) model series ## Overview Agentic data consists of realistic trajectories in which an AI agent gradually comprehends the human's intent and interacts with tools and the environment step-by-step to complete the task. APIGen-MT builds on [APIGen](https://apigen-pipeline.github.io/), which focuses on generating single-turn function calling data. It addresses the lack of high-quality multi-turn agent interaction data in public datasets and the high cost of manually collecting such data for domain-specific applications. Each task in our dataset is verified through three hierarchical stages: format checking, function executions and domain policy check, and semantic verification, ensuring its reliability and correctness. We conducted a human evaluation over 200 sampled trajectories, and the success rate is 99%. The overall framework for the dataset collection procedure is shown below. See more details at our project [website](https://apigen-mt.github.io/). <div style="text-align: center;"> <img src="https://github.com/apigen-mt/apigen-mt.github.io/blob/main/img/pipeline.png?raw=true" alt="APIGen-MT Overview" width="620" style="margin: auto;"> </div> ## Dataset Details - **Models Used**: [GPT-4o](https://platform.openai.com/docs/models/gpt-4o), [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) - **Domains**: Retail and Airline (via [τ-bench](https://github.com/sierra-research/tau-bench)) - **Size**: 5000 multi-turn dialogues - **Format**: ShareGPT-like JSON, with structured conversation turns The dataset is at `apigen-mt_5k.json`. After accepting the usage terms and login in your HuggingFace account, you can simply access the dataset using ```python from datasets import load_dataset datasets = load_dataset("Salesforce/APIGen-MT-5k") ``` The data is released in *ShareGPT* format shown below ```json [ { "conversations": [ { "from": "human", "value": "human query" }, { "from": "function_call", "value": "tool arguments" }, { "from": "observation", "value": "tool result" }, { "from": "gpt", "value": "agent response" } ], "system": "system prompt (having domain policy)", "tools": "tool description" } ] ``` ## Benchmark Results ### Berkeley Function-Calling Leaderboard (BFCL v3) <img width="80%" alt="BFCL Results" src="https://github.com/apigen-mt/apigen-mt.github.io/blob/main/img/bfcl-result.png?raw=true"> Performance comparison of different models on [BFCL leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html). The rank is based on the overall accuracy, which is a weighted average of different evaluation categories. "FC" stands for function-calling mode in contrast to using a customized "prompt" to extract the function calls. ### τ-bench Benchmark <img width="80%" alt="Tau-bench Results" src="https://github.com/apigen-mt/apigen-mt.github.io/blob/main/img/taubench-result.png?raw=true"> Success Rate (pass@1) on τ-bench benchmark averaged across at least 5 trials. Our xLAM-2-70b-fc-r model achieves an overall success rate of 56.2% on τ-bench, significantly outperforming the base Llama 3.1 70B Instruct model (38.2%) and other open-source models like DeepSeek v3 (40.6%). Notably, our best model even outperforms proprietary models such as GPT-4o (52.9%) and approaches the performance of more recent models like Claude 3.5 Sonnet (new) (60.1%). <img width="80%" alt="Pass^k curves" src="https://github.com/apigen-mt/apigen-mt.github.io/blob/main/img/pass_k_curves_retail_airline.png?raw=true"> Pass^k curves measuring the probability that all 5 independent trials succeed for a given task, averaged across all tasks for τ-retail (left) and τ-airline (right) domains. Higher values indicate better consistency of the models. ## Ethical Considerations This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people's lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP. ### Data Licenses A part of this dataset was generated using GPT-4 and should not be used to develop models that compete with OpenAI. ## Citation If you use our model or dataset in your work, please cite our paper: ```bibtex @article{prabhakar2025apigen, title={APIGen-MT: Agentic PIpeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay}, author={Prabhakar, Akshara and Liu, Zuxin and Zhu, Ming and Zhang, Jianguo and Awalgaonkar, Tulika and Wang, Shiyu and Liu, Zhiwei and Chen, Haolin and Hoang, Thai and others}, journal={arXiv preprint arXiv:2504.03601}, year={2025} } ```

## 摘要 - [APIGen-MT](https://apigen-mt.github.io/) 是一款自动化智能体数据生成流水线，旨在为智能体应用合成**可验证、高质量、贴合现实的数据集**。 - 本数据集随论文《APIGen-MT: Agentic PIpeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay》（https://arxiv.org/abs/2504.03601）一同发布。 - 代码仓库：[https://github.com/apigen-mt/apigen-mt.github.io](https://github.com/apigen-mt/apigen-mt.github.io) - 该仓库包含由APIGen-MT采集的**5000条**多轮交互轨迹。 - 本数据集是用于训练[xLAM-2](https://huggingface.co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4)模型系列的数据集的子集。 ## 概述智能体数据由贴合现实的交互轨迹构成，其中AI 智能体（AI Agent）逐步理解人类意图，并通过与工具和环境的逐步交互来完成任务。APIGen-MT 基于专注于生成单轮函数调用（function calling）数据的[APIGen](https://apigen-pipeline.github.io/)开发而来，旨在解决公开数据集中缺乏高质量多轮智能体交互数据，以及针对特定领域应用手动采集此类数据成本高昂的痛点。本数据集中的每个任务均通过三级验证流程确保可靠性与正确性：格式检查、函数执行与领域策略校验，以及语义验证。我们对200条采样轨迹开展了人工评估，最终成功率达到99%。数据集采集流程的整体框架如下所示。更多细节可查阅本项目[官网](https://apigen-mt.github.io/)。 <div style="text-align: center;"> <img src="https://github.com/apigen-mt/apigen-mt.github.io/blob/main/img/pipeline.png?raw=true" alt="APIGen-MT 概览" width="620" style="margin: auto;"> </div> ## 数据集详情 - **所用模型**：[GPT-4o（GPT-4o）](https://platform.openai.com/docs/models/gpt-4o)、[DeepSeek-V3（DeepSeek-V3）](https://github.com/deepseek-ai/DeepSeek-V3) - **覆盖领域**：零售与航空领域（基于[τ-bench（τ-bench）](https://github.com/sierra-research/tau-bench)） - **数据集规模**：5000条多轮对话 - **数据格式**：类ShareGPT的JSON格式，包含结构化的对话轮次数据集文件名为`apigen-mt_5k.json`。在同意使用条款并登录您的HuggingFace账户后，您可通过以下代码直接加载该数据集： python from datasets import load_dataset datasets = load_dataset("Salesforce/APIGen-MT-5k") 本数据集采用如下所示的*ShareGPT*格式： json [ { "conversations": [ { "from": "human", "value": "human query" }, { "from": "function_call", "value": "tool arguments" }, { "from": "observation", "value": "tool result" }, { "from": "gpt", "value": "agent response" } ], "system": "system prompt (having domain policy)", "tools": "tool description" } ] ## 基准测试结果 ### 伯克利函数调用排行榜（Berkeley Function-Calling Leaderboard, BFCL v3） <img width="80%" alt="BFCL 测试结果" src="https://github.com/apigen-mt/apigen-mt.github.io/blob/main/img/bfcl-result.png?raw=true"> 不同模型在[BFCL排行榜](https://gorilla.cs.berkeley.edu/leaderboard.html)上的性能对比。排名基于整体准确率，该指标为各评估类别加权平均值。"FC"代表函数调用模式，与使用自定义"提示词（prompt）"提取函数调用的方式形成对比。 ### τ-bench 基准测试 <img width="80%" alt="τ-bench 测试结果" src="https://github.com/apigen-mt/apigen-mt.github.io/blob/main/img/taubench-result.png?raw=true"> τ-bench基准测试上的成功率（pass@1）为至少5次试验的平均值。我们的xLAM-2-70b-fc-r模型在τ-bench上的整体成功率达到56.2%，显著优于基础版Llama 3.1 70B Instruct模型（38.2%）以及DeepSeek V3（40.6%）等其他开源模型。值得注意的是，我们的最优模型甚至超越了GPT-4o（52.9%）等闭源模型，性能接近Claude 3.5 Sonnet（新版）（60.1%）等更新型号。 <img width="80%" alt="Pass^k 曲线" src="https://github.com/apigen-mt/apigen-mt.github.io/blob/main/img/pass_k_curves_retail_airline.png?raw=true"> Pass^k曲线用于衡量给定任务的5次独立试验全部成功的概率，为τ-retail（左）和τ-airline（右）领域所有任务的平均值。数值越高代表模型的一致性越好。 ## 伦理考量本次发布仅用于支持学术论文的研究用途。我们的模型、数据集与代码并非为所有下游应用专门设计或评估。我们强烈建议用户在部署本模型前，评估并解决与准确性、安全性及公平性相关的潜在问题。我们鼓励用户考虑人工智能的普遍局限性，遵守适用法律法规，并在选择应用场景时遵循最佳实践，尤其是在错误或滥用可能严重影响民众生命、权利或安全的高风险场景中。如需获取应用场景的进一步指导，请参阅我们的AUP与AI AUP。 ### 数据许可本数据集的部分内容由GPT-4生成，不得用于开发与OpenAI竞争的模型。 ## 引用若您在研究中使用本模型或数据集，请引用我们的论文： bibtex @article{prabhakar2025apigen, title={APIGen-MT: Agentic PIpeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay}, author={Prabhakar, Akshara and Liu, Zuxin and Zhu, Ming and Zhang, Jianguo and Awalgaonkar, Tulika and Wang, Shiyu and Liu, Zhiwei and Chen, Haolin and Hoang, Thai and others}, journal={arXiv preprint arXiv:2504.03601}, year={2025} }

提供机构：

maas

创建时间：

2025-05-16

搜集汇总

数据集介绍