TIR-Bench
收藏魔搭社区2026-05-15 更新2026-05-17 收录
下载链接:
https://modelscope.cn/datasets/evalscope/TIR-Bench
下载链接
链接失效反馈官方服务:
资源简介:
# TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning
## Introduction:
TIR-Bench is a comprehensive benchmark designed to evaluate the "thinking-with-images" capabilities of Multimodal Large Language Models (MLLMs), addressing a gap left by existing benchmarks like Visual Search which only test basic operations. As models like OpenAI o3 begin to intelligently create and operate tools to transform images for problem-solving, TIR-Bench provides 13 diverse tasks that each require novel tool use for image processing and manipulation within a chain-of-thought. Our evaluation of 22 leading MLLMs (including open-sourced, proprietary, and tool-augmented models) shows that TIR-Bench is universally challenging and that strong performance requires genuine agentic thinking-with-images capabilities. This repository contains the full benchmark, evaluation scripts, and a pilot study comparing direct versus agentic fine-tuning for this advanced reasoning.
Paper Link: [https://arxiv.org/abs/2511.01833](https://arxiv.org/abs/2511.01833)
If you use this benchmark in your research, please consider citing it as follows:
```
@article{li2025tir,
title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning},
author={Li, Ming and Zhong, Jike and Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Lai, Yuxiang and Chen, Wei and Psounis, Konstantinos and Zhang, Kaipeng},
journal={arXiv preprint arXiv:2511.01833},
year={2025}
}
```
# TIR-Bench:面向智能体图像思考推理的综合基准测试集
## 简介:
TIR-Bench是一款专为评估多模态大语言模型(Multimodal Large Language Models, MLLMs)的“图像思考”能力而构建的综合基准测试集,弥补了视觉搜索(Visual Search)等现有基准测试仅能测试基础操作的空白。随着OpenAI o3等模型开始能够智能创建并操作工具以转换图像来解决问题,TIR-Bench提供了13种多样化任务,每一项任务均需在思维链(chain-of-thought)中借助新颖工具完成图像处理与操作。我们对22款主流多模态大语言模型(涵盖开源、闭源及工具增强型模型)的评估结果显示,TIR-Bench具有普遍挑战性,想要取得优异性能,必须具备真正的智能体图像思考能力。本仓库包含完整的基准测试集、评估脚本,以及针对该高级推理任务对比直接微调与智能体微调的预实验研究。
论文链接:[https://arxiv.org/abs/2511.01833](https://arxiv.org/abs/2511.01833)
若您在研究中使用该基准测试集,请按以下格式引用:
@article{li2025tir,
title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning},
author={Li, Ming and Zhong, Jike and Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Lai, Yuxiang and Chen, Wei and Psounis, Konstantinos and Zhang, Kaipeng},
journal={arXiv preprint arXiv:2511.01833},
year={2025}
}
提供机构:
maas
创建时间:
2026-04-15



