five

VitaBench

收藏
魔搭社区2026-05-16 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/meituan/VitaBench
下载链接
链接失效反馈
官方服务:
资源简介:
<div align=center><h1> 🌱VitaBench: Benchmarking LLM Agents<br> with Versatile Interactive Tasks </h1></div> <p align="center"> 📃 <a href="https://arxiv.org/abs/2509.26490" target="_blank">Paper</a > • 🌐 <a href="https://vitabench.github.io/" target="_blank">Website</a > • 🏆 <a href="https://vitabench.github.io/#Leaderboard" target="_blank">Leaderboard</a > • 🛠️ <a href="https://github.com/meituan-longcat/vitabench" target="_blank">Code</a > • 🤗 <a href="https://huggingface.co/datasets/meituan-longcat/VitaBench" target="_blank">Dataset</a ><br> </p > ## 🔔 News - [2026-01] **[Qwen3-Max-Thinking](https://qwen.ai/blog?id=qwen3-max-thinking)** reported our Vita-Bench to evaluate and demonstrate its tool use capabilities (the averge score of 4 domains)!We invite the community to adopt Vita-Bench as the definitive touchstone for tool use performance assessment, and **we appreciate diverse utilization & interpretation of our benchmark results**. What's more, feel free to check our recently updated version! - [2026-01] VitaBench has been accepted to **[ICLR 2026](https://openreview.net/forum?id=rtcX9qOBaz)**! 🎉 - [2026-01] An updated version of our benchmark is released with rectified datasets and tools, upgraded evaluation models, and updated metrics for proprietary and open language models based on the new evaluator. - [2025-11] The English version of the VitaBench dataset is now released! It includes fully translated tasks and databases, enabling broader international use. Try it out! - [2025-10] Our paper is released on arXiv: [VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications](https://arxiv.org/abs/2509.26490) - [2025-10] The VitaBench suite is released, including the **codebase, dataset and evaluation pipeline**! If you have any questions, feel free to raise issues and/or submit pull requests for new features of bug fixes. ## 📖 Introduction In this paper, we introduce **VitaBench**, a challenging benchmark that evaluates agents on **v**ersatile **i**nteractive **ta**sks grounded in real-world settings. Drawing from daily applications in food delivery, in-store consumption, and online travel services, VitaBench presents agents with the most complex life-serving simulation environment to date, comprising **66 tools**. Through a framework that eliminates domain-specific policies, we enable flexible composition of these scenarios and tools, yielding **100 cross-scenario tasks (main results) and 300 single-scenario tasks**. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, utilize complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent throughout multi-turn conversations. Moreover, we propose a rubric-based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 32.5% success rate on cross-scenario tasks, and less than 62% success rate on others. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real-world applications. > *The name “Vita” derives from the Latin word for “Life”, reflecting our focus on life-serving applications.* ![overall_performance](assets/overall_performance.png) ## 🌱 Benchmark Details VitaBench provides an evaluation framework that supports model evaluations on both single-domain and cross-domain tasks through flexible configuration. For cross-domain evaluation, simply connect multiple domain names with commas—this will automatically merge the environments of the specified domains into a unified environment. Statistics of databases and environments: | | Cross-Scenarios<br>(All domains) | Delivery | In-store | OTA | | :----------------------------- | :------------------------------: | :------: | :------: | :---: | | **Databases** | | | | | | &nbsp;&nbsp; Service Providers | 1,324 | 409 | 611 | 1,437 | | &nbsp;&nbsp; Products | 6,942 | 784 | 3,277 | 9,693 | | &nbsp;&nbsp; Transactions | 334 | 48 | 36 | 154 | | **API Tools** | | | | | | &nbsp;&nbsp; Write | 27 | 4 | 9 | 14 | | &nbsp;&nbsp; Read | 33 | 10 | 10 | 19 | | &nbsp;&nbsp; General | 6 | 6 | 5 | 5 | | **Tasks** | 100 | 100 | 100 | 100 | ## 🛠️ Environment VitaBench provides an evaluation framework that supports model evaluations on both single-domain and cross-domain tasks through flexible configuration. For cross-domain evaluation, simply connect multiple domain names with commas—this will automatically merge the environments of the specified domains into a unified environment. Please visit our GitHub repository [vitabench](https://github.com/meituan-longcat/vitabench) for more detailed instructions. ## 🔎 Citation If you find our work helpful or relevant to your research, please kindly cite our paper: ``` @article{he2025vitabench, title={VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications}, author={He, Wei and Sun, Yueqing and Hao, Hongyan and Hao, Xueyuan and Xia, Zhikang and Gu, Qi and Han, Chengcheng and Zhao, Dengchang and Su, Hui and Zhang, Kefeng and Gao, Man and Su, Xi and Cai, Xiaodong and Cai, Xunliang and Yang, Yu and Zhao, Yunke}, journal={arXiv preprint arXiv:2509.26490}, year={2025} } ``` ## 📜 License This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.
提供机构:
maas
创建时间:
2025-10-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作