five

TIR-Bench

收藏
魔搭社区2026-05-15 更新2026-05-17 收录
下载链接:
https://modelscope.cn/datasets/evalscope/TIR-Bench
下载链接
链接失效反馈
官方服务:
资源简介:
# TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning ## Introduction: TIR-Bench is a comprehensive benchmark designed to evaluate the "thinking-with-images" capabilities of Multimodal Large Language Models (MLLMs), addressing a gap left by existing benchmarks like Visual Search which only test basic operations. As models like OpenAI o3 begin to intelligently create and operate tools to transform images for problem-solving, TIR-Bench provides 13 diverse tasks that each require novel tool use for image processing and manipulation within a chain-of-thought. Our evaluation of 22 leading MLLMs (including open-sourced, proprietary, and tool-augmented models) shows that TIR-Bench is universally challenging and that strong performance requires genuine agentic thinking-with-images capabilities. This repository contains the full benchmark, evaluation scripts, and a pilot study comparing direct versus agentic fine-tuning for this advanced reasoning. Paper Link: [https://arxiv.org/abs/2511.01833](https://arxiv.org/abs/2511.01833) If you use this benchmark in your research, please consider citing it as follows: ``` @article{li2025tir, title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning}, author={Li, Ming and Zhong, Jike and Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Lai, Yuxiang and Chen, Wei and Psounis, Konstantinos and Zhang, Kaipeng}, journal={arXiv preprint arXiv:2511.01833}, year={2025} } ```

# TIR-Bench:面向智能体图像思考推理的综合基准测试集 ## 简介: TIR-Bench是一款专为评估多模态大语言模型(Multimodal Large Language Models, MLLMs)的“图像思考”能力而构建的综合基准测试集,弥补了视觉搜索(Visual Search)等现有基准测试仅能测试基础操作的空白。随着OpenAI o3等模型开始能够智能创建并操作工具以转换图像来解决问题,TIR-Bench提供了13种多样化任务,每一项任务均需在思维链(chain-of-thought)中借助新颖工具完成图像处理与操作。我们对22款主流多模态大语言模型(涵盖开源、闭源及工具增强型模型)的评估结果显示,TIR-Bench具有普遍挑战性,想要取得优异性能,必须具备真正的智能体图像思考能力。本仓库包含完整的基准测试集、评估脚本,以及针对该高级推理任务对比直接微调与智能体微调的预实验研究。 论文链接:[https://arxiv.org/abs/2511.01833](https://arxiv.org/abs/2511.01833) 若您在研究中使用该基准测试集,请按以下格式引用: @article{li2025tir, title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning}, author={Li, Ming and Zhong, Jike and Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Lai, Yuxiang and Chen, Wei and Psounis, Konstantinos and Zhang, Kaipeng}, journal={arXiv preprint arXiv:2511.01833}, year={2025} }
提供机构:
maas
创建时间:
2026-04-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作