Toucan-1.5M

Name: Toucan-1.5M
Creator: maas
Published: 2025-12-10 16:52:20
License: 暂无描述

魔搭社区2025-12-10 更新2025-10-11 收录

下载链接：

https://modelscope.cn/datasets/Agent-Ark/Toucan-1.5M

下载链接

链接失效反馈

官方服务：

资源简介：

# 🦤 Toucan-1.5M: Toucan-1.5M is the largest fully synthetic tool-agent dataset to date, designed to advance tool use in agentic LLMs. It comprises over 1.5 million trajectories synthesized from 495 real-world Model Context Protocols (MCPs) spanning 2,000+ tools. By leveraging authentic MCP environments, Toucan-1.5M generates diverse, realistic, and challenging tasks requires using multiple tools, with trajectories involving real tool executions across multi-round, multi-turn, sequential, and parallel tool calls. Models fine-tuned on Toucan-1.5M outperform much larger closed-source counterparts on the BFCL V3 benchmark and extend the Pareto frontier on the MCP-Universe benchmark. - 📄 [Technical Report](https://arxiv.org/abs/2510.01179) - Discover the methodology and technical details behind Toucan-1.5M - 💾 [Github Repo](https://github.com/TheAgentArk/Toucan) - Access the complete pipeline used to produce Toucan-1.5M - 🤗 [HF Dataset](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) - Full dataset (You are here!) - 🤖 Model Checkpoints - [Qwen2.5-7B](https://huggingface.co/Agent-Ark/Toucan-Qwen2.5-7B-Instruct-v0.1) | [Qwen2.5-14B](https://huggingface.co/Agent-Ark/Toucan-Qwen2.5-7B-Instruct-v0.1) | [Qwen2.5-32B](https://huggingface.co/Agent-Ark/Toucan-Qwen2.5-32B-Instruct-v0.1) ![Toucan-Pipeline](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/Dcz-NP1tfcJriku8FP2OT.jpeg) ## 📄 Dataset Schema An instance of Toucan-1.5M contains the following columns: - **uuid:** Unique data instance identifier. - **subset:** Annotation specifying which pipeline was used to generate the trajectory. Options: 1. *single-turn-original:* only the core synthetic data generation pipeline (Stage 1 to 5) are applied. 2. *irrelevant:* a server shuffle process applied on top of the *single-turn-original* pipeline. 3. *single-turn-diversify:* a question diversification process applied on top of the *single-turn-original* pipeline. 4. *multi-turn:* a multi-turn extension of the *single-turn-original* and *single-turn-diversify* subsets. - **messages:** The trajectory formatted with the chat template from the original LLM-agent used for generation. The system prompt includes the associated list of tools with Hermes format. - **question:** The user task crafted to generate the trajectory. - **target_tools:** The MCP tools used as seeds for question generation. If multiple MCP servers are involved, we use the format `Server_Name::Tool_Name`; otherwise, we present only `Tool_Name`. - **question_quality_assessment:** Task evaluation by an LLM-as-judge, covering quality, difficulty, realism, and uniqueness. - **response_quality_assessment:** Response evaluation by an LLM-as-judge, covering completeness and conciseness. - **metadata:** Original MCP server data collected and used as seed for generation, as well as respective LLM annotations. We include trajectories generated by Qwen3-32B, Kimi-K2, and GPT-OSS-120B, each stored under separate configurations. In addition, we provide a carefully curated SFT subset that is readily available for model fine-tuning in [Swift format](https://github.com/modelscope/ms-swift/blob/7bd6b014bbf6ced2f248800e5abb681618f2a6bd/docs/source_en/Instruction/Agent-support.md), with its performance demonstrated below. ## 📊 Dataset Stats and Performance The below histogram illustrates the Toucan dataset analysis. Subfigure (a) and (b) provide statistics on the number of servers and required tools per instance, highlighting Toucan's comprehensive coverage of multi-server and multi-tool tasks. Subfigures (c) and (d) reveal that most tasks include more tools in the context than the targeted tools, underscoring the non-trivial tool selection challenges. Subfigure (e) displays the length of user messages in tokens. Subfigures (f) and (h) demonstrate the multi-turn nature of the tasks, characterized by extended and diverse interactions among users, agents, and tools. Subfigure (g) demonstrates that Toucan encompasses both single and parallel tool calls, which enhance the dataset's versatility in capturing diverse agent-tool interaction patterns. ![hf_histo](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/6fblRgoORB0OHNNJWMOpK.jpeg) The below figure shows subset distribution and dataset performance with SFT. We observe that Toucan remarkably improves baseline model performance through supervised fine-tuning (SFT) and enables smaller models to outperform larger models across different evaluation aspects. ![HF_perf](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/_O6VK5ij2gVfJL79edCUT.jpeg) ## 🧐 Other Information **License**: This dataset is released under Apache 2.0. **PII Notice**: We have made a best-effort attempt to scan our datasets and remove PII using rule-based string replacements. **Caution**: The data were collected between June and September 2025; therefore, tool responses may reflect events restricted to this period, potentially introducing biases into training. Since we primarily use community MCP servers, the data are subject to stability issues such as frequent connection failures. We only filter out trajectories where all tool calls fail to yield meaningful responses, in order to preserve examples for training error-handling capabilities. **Contact**: For questions, please contact [Zhangchen](mailto:zxu9@uw.edu) by email. ## 📚 Citation If you find the data or code useful, please cite: ``` @misc{xu2025toucan, title={TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments}, author={Zhangchen Xu and Adriana Meza Soria and Shawn Tan and Anurag Roy and Ashish Sunil Agrawal and Radha Poovendran and Rameswar Panda}, year={2025}, eprint={2510.01179}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2510.01179}, } ```

# 🦤 Toucan-1.5M： Toucan-1.5M是目前规模最大的全合成式智能体工具交互数据集，旨在推动大语言模型（LLM）智能体的工具使用能力发展。该数据集包含超过150万条轨迹数据，这些轨迹源自495个真实世界的模型上下文协议（Model Context Protocols, MCPs），覆盖2000余种工具。依托真实的MCP环境，Toucan-1.5M生成涵盖多工具使用的多样化、高逼真度且富有挑战性的任务，其轨迹包含多轮、多回合、串行及并行的真实工具调用执行过程。在Toucan-1.5M上进行微调的模型，在BFCL V3基准测试中优于规模更大的闭源模型，并在MCP-Universe基准测试中拓展了帕累托前沿。 - 📄 [技术报告](https://arxiv.org/abs/2510.01179) - 了解Toucan-1.5M背后的研究方法与技术细节 - 💾 [GitHub 仓库](https://github.com/TheAgentArk/Toucan) - 获取构建Toucan-1.5M的完整流程代码 - 🤗 [Hugging Face 数据集页面](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) - 完整数据集（当前页面） - 🤖 模型检查点：[Qwen2.5-7B](https://huggingface.co/Agent-Ark/Toucan-Qwen2.5-7B-Instruct-v0.1) | [Qwen2.5-14B](https://huggingface.co/Agent-Ark/Toucan-Qwen2.5-7B-Instruct-v0.1) | [Qwen2.5-32B](https://huggingface.co/Agent-Ark/Toucan-Qwen2.5-32B-Instruct-v0.1) ![Toucan-Pipeline](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/Dcz-NP1tfcJriku8FP2OT.jpeg) ## 📄 数据集架构（Dataset Schema） Toucan-1.5M的每条数据实例包含以下字段： - **uuid：** 唯一数据实例标识符。 - **subset：** 标注用于指定生成轨迹所使用的流水线分支，可选值包括： 1. *single-turn-original:* 仅应用核心合成数据生成流水线（阶段1至阶段5）。 2. *irrelevant:* 在*single-turn-original*流水线基础上添加服务器洗牌流程。 3. *single-turn-diversify:* 在*single-turn-original*流水线基础上添加问题多样化处理流程。 4. *multi-turn:* 对*single-turn-original*与*single-turn-diversify*子集进行多回合扩展得到的子集。 - **messages：** 采用生成所用的原始大语言模型智能体的对话模板格式编排的轨迹数据，系统提示中包含采用Hermes格式的工具列表。 - **question：** 用于生成该轨迹的用户任务描述。 - **target_tools：** 用于种子问题生成的MCP工具。若涉及多个MCP服务器，采用`Server_Name::Tool_Name`格式标注；若仅单个服务器，则直接标注`Tool_Name`。 - **question_quality_assessment：** 由LLM作为裁判进行的任务质量评估，涵盖任务质量、难度、逼真度与独特性四个维度。 - **response_quality_assessment：** 由LLM作为裁判进行的回复质量评估，涵盖完整性与简洁性两个维度。 - **metadata：** 用于生成数据的原始MCP服务器种子数据，以及对应的大语言模型标注信息。我们收录了由Qwen3-32B、Kimi-K2及GPT-OSS-120B生成的轨迹数据，每种模型生成的数据均存储于独立配置项下。此外，我们还提供了经过精心筛选的监督微调（Supervised Fine-Tuning, SFT）子集，该子集可直接用于模型微调，支持[Swift格式](https://github.com/modelscope/ms-swift/blob/7bd6b014bbf6ced2f248800e5abb681618f2a6bd/docs/source_en/Instruction/Agent-support.md)，其性能表现将在下文展示。 ## 📊 数据集统计与性能表现下文的直方图展示了Toucan数据集的分析结果：子图(a)与(b)统计了每条数据实例涉及的服务器数量与所需工具数量，凸显了Toucan对多服务器、多工具任务的全面覆盖。子图(c)与(d)显示，多数任务的上下文工具数量多于目标工具数量，这凸显了工具选择任务的挑战性。子图(e)展示了用户消息的Token长度分布。子图(f)与(h)体现了任务的多回合特性，其交互模式为用户、智能体与工具间的多样化长时交互。子图(g)则表明Toucan数据集同时包含单工具调用与并行工具调用场景，增强了数据集在捕捉多样化智能体-工具交互模式方面的通用性。 ![hf_histo](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/6fblRgoORB0OHNNJWMOpK.jpeg) 下图展示了数据集子集分布与SFT微调后的模型性能。我们发现，通过监督微调（SFT），Toucan数据集可显著提升基线模型的性能，并能让小规模模型在多项评估维度上超越大规模模型。 ![HF_perf](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/_O6VK5ij2gVfJL79edCUT.jpeg) ## 🧐 其他说明 **许可证**：本数据集采用Apache 2.0协议开源发布。 **个人可识别信息（PII）声明**：我们已尽最大努力通过基于规则的字符串替换操作，对数据集进行扫描并移除其中包含的个人可识别信息（PII）。 **注意事项**：本数据集的采集时间为2025年6月至9月，因此工具响应内容可能仅反映该时间段内的事件，可能会为训练引入偏差。由于我们主要使用社区MCP服务器，数据集可能存在连接失败等稳定性问题。我们仅过滤掉所有工具调用均未返回有效响应的轨迹，以保留用于训练错误处理能力的样本。 **联系方式**：如有疑问，请通过邮件联系[张晨](mailto:zxu9@uw.edu)。 ## 📚 引用格式如果您认为本数据集或代码对您的研究有所帮助，请引用以下文献： @misc{xu2025toucan, title={TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments}, author={Zhangchen Xu and Adriana Meza Soria and Shawn Tan and Anurag Roy and Ashish Sunil Agrawal and Radha Poovendran and Rameswar Panda}, year={2025}, eprint={2510.01179}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2510.01179}, }

提供机构：

maas

创建时间：

2025-10-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集