five

kunato/Toucan-1.5M

收藏
Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kunato/Toucan-1.5M
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: Kimi-K2 features: - name: uuid dtype: string - name: subset_name dtype: string - name: messages dtype: string - name: question dtype: string - name: available_tools dtype: string - name: target_tools dtype: string - name: question_quality_assessment dtype: string - name: response_quality_assessment dtype: string - name: metadata dtype: string splits: - name: train num_bytes: 19540301213 num_examples: 518516 download_size: 6392602476 dataset_size: 19540301213 - config_name: OSS features: - name: uuid dtype: string - name: subset_name dtype: string - name: messages dtype: string - name: question dtype: string - name: available_tools dtype: string - name: target_tools dtype: string - name: question_quality_assessment dtype: string - name: response_quality_assessment dtype: string - name: metadata dtype: string splits: - name: train num_bytes: 23321900170 num_examples: 457130 download_size: 8158074700 dataset_size: 23321900170 - config_name: Qwen3 features: - name: uuid dtype: string - name: subset_name dtype: string - name: messages dtype: string - name: question dtype: string - name: available_tools dtype: string - name: target_tools dtype: string - name: question_quality_assessment dtype: string - name: response_quality_assessment dtype: string - name: metadata dtype: string splits: - name: train num_bytes: 21763561944 num_examples: 551613 download_size: 6837495729 dataset_size: 21763561944 - config_name: SFT features: - name: uuid dtype: string - name: subset_name dtype: string - name: question dtype: string - name: target_tools dtype: string - name: tools dtype: string - name: messages dtype: string splits: - name: train num_bytes: 1346302110 num_examples: 119287 download_size: 425496735 dataset_size: 1346302110 configs: - config_name: Kimi-K2 data_files: - split: train path: Kimi-K2/train-* - config_name: OSS data_files: - split: train path: OSS/train-* - config_name: Qwen3 data_files: - split: train path: Qwen3/train-* - config_name: SFT data_files: - split: train path: SFT/train-* license: apache-2.0 size_categories: - 1M<n<10M --- # 🦤 Toucan-1.5M: Toucan-1.5M is the largest fully synthetic tool-agent dataset to date, designed to advance tool use in agentic LLMs. It comprises over 1.5 million trajectories synthesized from 495 real-world Model Context Protocols (MCPs) spanning 2,000+ tools. By leveraging authentic MCP environments, Toucan-1.5M generates diverse, realistic, and challenging tasks requires using multiple tools, with trajectories involving real tool executions across multi-round, multi-turn, sequential, and parallel tool calls. Models fine-tuned on Toucan-1.5M outperform much larger closed-source counterparts on the BFCL V3 benchmark and extend the Pareto frontier on the MCP-Universe benchmark. - 📄 [Technical Report](https://arxiv.org/abs/2510.01179) - Discover the methodology and technical details behind Toucan-1.5M - 💾 [Github Repo](https://github.com/TheAgentArk/Toucan) - Access the complete pipeline used to produce Toucan-1.5M - 🤗 [HF Dataset](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) - Full dataset (You are here!) - 🤖 Model Checkpoints - [Qwen2.5-7B](https://huggingface.co/Agent-Ark/Toucan-Qwen2.5-7B-Instruct-v0.1) | [Qwen2.5-14B](https://huggingface.co/Agent-Ark/Toucan-Qwen2.5-7B-Instruct-v0.1) | [Qwen2.5-32B](https://huggingface.co/Agent-Ark/Toucan-Qwen2.5-32B-Instruct-v0.1) ![Toucan-Pipeline](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/Dcz-NP1tfcJriku8FP2OT.jpeg) ## 📄 Dataset Schema An instance of Toucan-1.5M contains the following columns: - **uuid:** Unique data instance identifier. - **subset:** Annotation specifying which pipeline was used to generate the trajectory. Options: 1. *single-turn-original:* only the core synthetic data generation pipeline (Stage 1 to 5) are applied. 2. *irrelevant:* a server shuffle process applied on top of the *single-turn-original* pipeline. 3. *single-turn-diversify:* a question diversification process applied on top of the *single-turn-original* pipeline. 4. *multi-turn:* a multi-turn extension of the *single-turn-original* and *single-turn-diversify* subsets. - **messages:** The trajectory formatted with the chat template from the original LLM-agent used for generation. The system prompt includes the associated list of tools with Hermes format. - **question:** The user task crafted to generate the trajectory. - **target_tools:** The MCP tools used as seeds for question generation. If multiple MCP servers are involved, we use the format `Server_Name::Tool_Name`; otherwise, we present only `Tool_Name`. - **question_quality_assessment:** Task evaluation by an LLM-as-judge, covering quality, difficulty, realism, and uniqueness. - **response_quality_assessment:** Response evaluation by an LLM-as-judge, covering completeness and conciseness. - **metadata:** Original MCP server data collected and used as seed for generation, as well as respective LLM annotations. We include trajectories generated by Qwen3-32B, Kimi-K2, and GPT-OSS-120B, each stored under separate configurations. In addition, we provide a carefully curated SFT subset that is readily available for model fine-tuning in [Swift format](https://github.com/modelscope/ms-swift/blob/7bd6b014bbf6ced2f248800e5abb681618f2a6bd/docs/source_en/Instruction/Agent-support.md), with its performance demonstrated below. ## 📊 Dataset Stats and Performance The below histogram illustrates the Toucan dataset analysis. Subfigure (a) and (b) provide statistics on the number of servers and required tools per instance, highlighting Toucan's comprehensive coverage of multi-server and multi-tool tasks. Subfigures (c) and (d) reveal that most tasks include more tools in the context than the targeted tools, underscoring the non-trivial tool selection challenges. Subfigure (e) displays the length of user messages in tokens. Subfigures (f) and (h) demonstrate the multi-turn nature of the tasks, characterized by extended and diverse interactions among users, agents, and tools. Subfigure (g) demonstrates that Toucan encompasses both single and parallel tool calls, which enhance the dataset's versatility in capturing diverse agent-tool interaction patterns. ![hf_histo](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/6fblRgoORB0OHNNJWMOpK.jpeg) The below figure shows subset distribution and dataset performance with SFT. We observe that Toucan remarkably improves baseline model performance through supervised fine-tuning (SFT) and enables smaller models to outperform larger models across different evaluation aspects. ![HF_perf](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/_O6VK5ij2gVfJL79edCUT.jpeg) ## 🧐 Other Information **License**: This dataset is released under Apache 2.0. **PII Notice**: We have made a best-effort attempt to scan our datasets and remove PII using rule-based string replacements. **Caution**: The data were collected between June and September 2025; therefore, tool responses may reflect events restricted to this period, potentially introducing biases into training. Since we primarily use community MCP servers, the data are subject to stability issues such as frequent connection failures. We only filter out trajectories where all tool calls fail to yield meaningful responses, in order to preserve examples for training error-handling capabilities. **Contact**: For questions, please contact [Zhangchen](mailto:zxu9@uw.edu) by email. ## 📚 Citation If you find the data or code useful, please cite: ``` @misc{xu2025toucan, title={TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments}, author={Zhangchen Xu and Adriana Meza Soria and Shawn Tan and Anurag Roy and Ashish Sunil Agrawal and Radha Poovendran and Rameswar Panda}, year={2025}, eprint={2510.01179}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2510.01179}, } ```
提供机构:
kunato
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作