TIR-Bench

Name: TIR-Bench
Creator: maas
Published: 2026-05-15 17:07:34
License: 暂无描述

魔搭社区2026-05-15 更新2026-05-17 收录

下载链接：

https://modelscope.cn/datasets/evalscope/TIR-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

# TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning ## Introduction: TIR-Bench is a comprehensive benchmark designed to evaluate the "thinking-with-images" capabilities of Multimodal Large Language Models (MLLMs), addressing a gap left by existing benchmarks like Visual Search which only test basic operations. As models like OpenAI o3 begin to intelligently create and operate tools to transform images for problem-solving, TIR-Bench provides 13 diverse tasks that each require novel tool use for image processing and manipulation within a chain-of-thought. Our evaluation of 22 leading MLLMs (including open-sourced, proprietary, and tool-augmented models) shows that TIR-Bench is universally challenging and that strong performance requires genuine agentic thinking-with-images capabilities. This repository contains the full benchmark, evaluation scripts, and a pilot study comparing direct versus agentic fine-tuning for this advanced reasoning. Paper Link: [https://arxiv.org/abs/2511.01833](https://arxiv.org/abs/2511.01833) If you use this benchmark in your research, please consider citing it as follows: ``` @article{li2025tir, title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning}, author={Li, Ming and Zhong, Jike and Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Lai, Yuxiang and Chen, Wei and Psounis, Konstantinos and Zhang, Kaipeng}, journal={arXiv preprint arXiv:2511.01833}, year={2025} } ```

# TIR-Bench：面向智能体图像思考推理的综合基准测试集 ## 简介： TIR-Bench是一款专为评估多模态大语言模型（Multimodal Large Language Models, MLLMs）的“图像思考”能力而构建的综合基准测试集，弥补了视觉搜索（Visual Search）等现有基准测试仅能测试基础操作的空白。随着OpenAI o3等模型开始能够智能创建并操作工具以转换图像来解决问题，TIR-Bench提供了13种多样化任务，每一项任务均需在思维链（chain-of-thought）中借助新颖工具完成图像处理与操作。我们对22款主流多模态大语言模型（涵盖开源、闭源及工具增强型模型）的评估结果显示，TIR-Bench具有普遍挑战性，想要取得优异性能，必须具备真正的智能体图像思考能力。本仓库包含完整的基准测试集、评估脚本，以及针对该高级推理任务对比直接微调与智能体微调的预实验研究。论文链接：[https://arxiv.org/abs/2511.01833](https://arxiv.org/abs/2511.01833) 若您在研究中使用该基准测试集，请按以下格式引用： @article{li2025tir, title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning}, author={Li, Ming and Zhong, Jike and Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Lai, Yuxiang and Chen, Wei and Psounis, Konstantinos and Zhang, Kaipeng}, journal={arXiv preprint arXiv:2511.01833}, year={2025} }

提供机构：

maas

创建时间：

2026-04-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集