five

FunReason-MT

收藏
魔搭社区2026-01-09 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/FunReason-MT
下载链接
链接失效反馈
官方服务:
资源简介:
# From Failure to Mastery: Generating Hard Samples for Tool-use Agents [![arXiv](https://img.shields.io/badge/arXiv-2601.01498-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2601.01498) [![arXiv](https://img.shields.io/badge/arXiv-2510.24645-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2510.24645) [![Model](https://img.shields.io/badge/Hugging%20Face-Model-yellow?logo=huggingface)](https://huggingface.co/Bingguang/FunReason-MT) [![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/inclusionAI/AWorld-RL) [![Project Page](https://img.shields.io/badge/Project-AWorld-green)](https://github.com/inclusionAI/AWorld) *** > [!IMPORTANT] Important Hint > - This is an extension of the technical report **FunReason-MT Technical Report: Advanced Data Synthesis Solution for Real-world Multi-Turn Tool-use** > - To allow the model to learn from errors, we specifically construct erroneous environmental responses. If you wish to delete this data, please delete the trajectories where `error_tool_response` is true. > - This is an initial version of our data, we will release the up-to-date version after the acceptance of our paper. ## Dataset The training set comprises **16,000 high-quality multi-turn samples**. This dataset was generated using the three-phase HardGen (FunReason-MT) data synthesis framework, which focuses on generating complex trajectories. ## 📊 HardGen Evaluation Results The model built upon HardGen is rigorously evaluated on the Berkeley Function-Calling Leaderboard (BFCL). ### BFCLv3 Multi-Turn and Single-Turn Performance | Model (4B - 235B) | Multi-Turn (Overall) | Single-Turn (Overall) | | :------------------------------------- | :------------------------------------------: | :------------------------------------------: | | Qwen3-4B-Instruct (Base) | 22.13 | 82.14 | | **Qwen3-4B + HardGen (RL)** | **63.13** | **87.14** | | Gemini-3-Pro-Preview | 60.75 | 86.89 | | DeepSeek-V3.2-Exp | 44.88 | 80.77 | | GPT-5.2-2025-12-11 | 28.13 | 76.12 | ### BFCL Agentic Evaluation (BFCLv4 OOD) The performance of models trained upon Llama-3.1-8B-Instruct on agentic tasks (Web Search and Memory). | Model | BFCLv4 Overall Score | | :----------------------------- | :------------------------------------------: | | **HardGen-8B (RL)** | **20.42** | |CoALM-8B | 1.40| |ToolACE-2-8B | 13.50| |BitAgent-8B | 8.24| |xLAM-2-8b-fc-r | 10.24| ----- ### Training Details - **Training Libraries:** LLama-Factory and Verl. - **Methodology:** Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). - **Hardware:** Conducted on 8 NVIDIA A100 GPUs. ----- ## 🔗 Related Projects and Citation This work is part of the open-source project **[AWorld, InclusionAI](https://github.com/inclusionAI/AWorld/)**. If you use this dataset in your research, please cite the technical report: ``` @article{hao2026failure, title={From Failure to Mastery: Generating Hard Samples for Tool-use Agents}, author={Hao, Bingguang and Xu, Zengzhuang and Wen, Yuntao and Xu, Xinyi and Liu, Yang and Zhao, Tong and Wang, Maolin and Chen, Long and Wang, Dong and Chen, Yicheng and others}, journal={arXiv preprint arXiv:2601.01498}, year={2026} } @article{xu2025funreason, title={FunReason-MT Technical Report: Advanced Data Synthesis Solution for Real-world Multi-Turn Tool-use}, author={Zengzhuang Xu, Bingguang Hao, Zechuan Wang, Yuntao Wen, Xinyi Xu, Yang Liu, Long Chen, Dong Wang, Maolin Wang, Tong Zhao, Yicheng Chen, Cunyin Peng, Jinjie Gu, Leilei Gan, Xiangyu Zhao, Chenyi Zhuang, Shi Gu}, journal={arXiv preprint arXiv:2510.24645}, year={2025} } ``` ### Contact For inquiries, please contact: * `bingguanghao7@gmail.com`

# 从失败到精通:为工具使用AI智能体生成高难度样本 [![arXiv](https://img.shields.io/badge/arXiv-2601.01498-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2601.01498) [![arXiv](https://img.shields.io/badge/arXiv-2510.24645-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2510.24645) [![Model](https://img.shields.io/badge/Hugging%20Face-Model-yellow?logo=huggingface)](https://huggingface.co/Bingguang/FunReason-MT) [![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/inclusionAI/AWorld-RL) [![Project Page](https://img.shields.io/badge/Project-AWorld-green)](https://github.com/inclusionAI/AWorld) *** > [!重要提示] 重要提示 > > - 本工作是技术报告**FunReason-MT技术报告:面向真实世界多轮工具使用场景的高级数据合成方案**的延伸工作 > - 为了让模型能够从错误中学习,我们专门构建了带有错误的环境交互响应。若您希望删除该数据集,请移除其中`error_tool_response`字段为真的轨迹数据。 > - 本数据集为初始版本,我们将在论文被接收后发布最新版本。 ## 数据集 训练集包含**16000条高质量多轮样本**。本数据集基于三阶段HardGen(FunReason-MT)数据合成框架生成,该框架专注于生成复杂交互轨迹。 ## 📊 HardGen 评测结果 基于HardGen框架训练的模型在伯克利函数调用排行榜(Berkeley Function-Calling Leaderboard,BFCL)上接受了严格评测。 ### BFCLv3 多轮与单轮评测性能 | 模型(参数规模:4B-235B) | 多轮评测(综合得分) | 单轮评测(综合得分) | | :------------------------------------- | :------------------------------------------: | :------------------------------------------: | | Qwen3-4B-Instruct(基础版) | 22.13 | 82.14 | | **Qwen3-4B + HardGen(强化学习版)** | **63.13** | **87.14** | | Gemini-3-Pro-Preview | 60.75 | 86.89 | | DeepSeek-V3.2-Exp | 44.88 | 80.77 | | GPT-5.2-2025-12-11 | 28.13 | 76.12 | ### BFCL 智能体评测(BFCLv4 分布外测试) 基于Llama-3.1-8B-Instruct微调的模型在智能体任务(网页搜索与记忆任务)上的评测表现。 | 模型 | BFCLv4 综合得分 | | :----------------------------- | :------------------------------------------: | | **HardGen-8B(强化学习版)** | **20.42** | | CoALM-8B | 1.40 | | ToolACE-2-8B | 13.50 | | BitAgent-8B | 8.24 | | xLAM-2-8b-fc-r | 10.24 | ----- ### 训练细节 - **训练框架:** LLama-Factory 与 Verl。 - **训练方法:** 先进行监督微调(Supervised Fine-Tuning,SFT),再开展强化学习(Reinforcement Learning,RL)。 - **硬件配置:** 使用8张NVIDIA A100 GPU完成训练。 ----- ## 🔗 相关项目与引用 本工作隶属于开源项目**[AWorld, InclusionAI](https://github.com/inclusionAI/AWorld/)**。 若您在研究中使用本数据集,请引用如下技术报告: @article{hao2026failure, title={From Failure to Mastery: Generating Hard Samples for Tool-use Agents}, author={Hao, Bingguang and Xu, Zengzhuang and Wen, Yuntao and Xu, Xinyi and Liu, Yang and Zhao, Tong and Wang, Maolin and Chen, Long and Wang, Dong and Chen, Yicheng and others}, journal={arXiv preprint arXiv:2601.01498}, year={2026} } @article{xu2025funreason, title={FunReason-MT Technical Report: Advanced Data Synthesis Solution for Real-world Multi-Turn Tool-use}, author={Zengzhuang Xu, Bingguang Hao, Zechuan Wang, Yuntao Wen, Xinyi Xu, Yang Liu, Long Chen, Dong Wang, Maolin Wang, Tong Zhao, Yicheng Chen, Cunyin Peng, Jinjie Gu, Leilei Gan, Xiangyu Zhao, Chenyi Zhuang, Shi Gu}, journal={arXiv preprint arXiv:2510.24645}, year={2025} } ### 联系方式 如有疑问,请联系: * `bingguanghao7@gmail.com`
提供机构:
maas
创建时间:
2025-11-03
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
FunReason-MT是一个专注于工具使用代理的数据集,包含16,000个高质量多轮样本,通过HardGen数据合成框架生成,旨在从错误中学习以提升代理性能。数据集在Berkeley Function-Calling Leaderboard (BFCL) 上展示了显著评估效果,例如在BFCLv3多轮任务中将Qwen3-4B模型性能从22.13提升至63.13,并在BFCLv4代理评估中达到20.42的总体得分。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作