ViC-Bench

Name: ViC-Bench
Creator: maas
Published: 2025-12-04 16:41:44
License: 暂无描述

魔搭社区2025-12-04 更新2025-07-19 收录

下载链接：

https://modelscope.cn/datasets/meituan/ViC-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

# ViC-Bench ![images](./assert/overview.png) ## About ViC-Bench Visual-Interleaved Chain-of-Thought (VI-CoT) enables MLLMs to continually update their understanding and decisions based on step-wise intermediate visual states (IVS), much like a human would, which demonstrates impressive success in various tasks, thereby leading to emerged advancements in related benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly distort the original thinking trajectories, failing to evaluate their intrinsic reasoning capabilities. More importantly, existing benchmarks neglect to systematically explore the impact factors that IVS would impart to untamed reasoning performance. To tackle above gaps, we introduce a specialized benchmark termed ViC-Bench, consisting of four representive tasks: maze navigation, jigsaw puzzle, embodied long-horizon planning, and complex counting, where each task has dedicated free-style IVS generation pipeline supporting function calls. To systematically examine VI-CoT capability, we propose a thorough evaluation suite incorporating a progressive three-stage strategy with targeted new metrics. Besides, we establish Incremental Prompting Information Injection (IPII) strategy to ablatively explore the prompting factors for VI-CoT. We extensively conduct evaluations for 18 advanced MLLMs, revealing key insights into their VI-CoT capability. ## Data Construction To evaluate the developments of recent VI-CoT methods, various benchmarks have emerged. Despite promising advancements, few of them provides free-style IVS representations to MLLMs, as illustrated in Tab. 1. CoMT primarily provides fixed IVS, which might forcibly distort the original planning trajectories. While MageBench offers the dynamic IVS but imposes the attribute constraints of action-observation memory. More importantly, existing benchmarks neglect to systematically assess the impact factors that IVS would impart to untamed reasoning performance in MLLMs. (i.e., Positive, Negative, or Null). ![images](./assert/statistics.png) We adopted structured processing workflows that integrate images and visual states to support model decision-making across four tasks: The diagram illustrates the data processing methods for four tasks, as follows: 1. **Maze Navigation**: Screens mazes that meet the criteria through preprocessing, selects from an image pool, marks target areas, and processes through three stages. Intermediate visual states are provided during each processing stage to assist in identifying the correct path. 2. **Jigsaw Puzzle**: Chooses suitable puzzle pieces from an image pool, conducts preprocessing, and marks puzzle target areas for processing within stages 2 and 3. Each processing stage provides function calls and intermediate visual states to guide task completion. 3. **Embodied Long-Horizon Planning**: Carries out preprocessing to ensure data quality, followed by manual checks and restructuring operations in stage 1 for data preparation. Models plan step-by-step towards provided stage goals throughout the processing stages. 4. **Complex Counting**: Utilizes image pool selection and complex counting preprocessing to set data. Tasks are processed through three stages, with intermediate visual states provided at each stage to assist the model in accurately counting the number of human heads in each area. ![images](./assert/pipelines.png) ## Evaluation Tab. 2 displays that most MLLMs exhibit competent performance in Stage 1. Performance significantly drops in Stage 2, indicating that current MLLMs have limitations in open-ended spatial reasoning and perception. In Stage 3, with the supports of free-style VIS, all models consistently achieves gains in global-level ACC and fine-grained R_o, leading to impressive ThinkGain, which indicates the effectiveness of free-style IVS in tackling deficiencies of spatial-aware cognition. ![images](./assert/performance.png) ## Data Samples for Three Stages ### Stage1 ```json { "instanceId": 142353922, "prompt": "<image_1>You are a complex counting expert. The given input image exist numerous human heads and are divided into four areas named 1, 2, 3, 4 by irregular lines. In this task, you need to correctly count the number of human heads in each area sequentially from 1 to 4 and sum them up to determine the total number of heads in the given input image. Please select the most appropriate option you think from the provided four options. \nA. 44 \nB. 39 \nC. 34 \nD. 29", "target": "B", "images": { "<image_1>": "ViC-Bench/images/counting/2170.png" }, "extra_data": { "options": [ 44, 39, 34, 29 ], "split": "(1, 9), (2, 11), (3, 11), (4, 8)" } } ``` ### Stage2 ```json { "instanceId": 142354430, "prompt": "<image_1>You are a complex counting expert. The given input image exist numerous human heads and are divided into four areas named 1, 2, 3, 4 by irregular lines. In this task, you need to correctly count the number of human heads in each area. The final answer format should be <Begin>(1, x), (2, x), (3, x), (4, x)</End>. For example, <Begin>(1, 10), (2, 14), (3, 21), (4, 23)</End>.", "target": "(1, 8), (2, 9), (3, 12), (4, 11)", "images": { "<image_1>": "ViC-Bench/images/counting/2882.png" }, "extra_data": { "total": 40 } } ``` ### Stage3 ```json { "instanceId": 142354469, "prompt": "<image_1>You are a complex counting expert. The given input image exist numerous human heads and are divided into four areas named {1, 2, 3, 4} by irregular lines. In this task, you need to correctly count the number of human heads in each area. Before making decision for each area, you can think, plan, and even reflect step by step, and then output your final judgement. The output decision format at each step should be <Begin> (x, y),</End>, where x denotes the area name (1, 2, 3, or 4) and y refers to head number. In addition, to assist you in making the final correct judgement, we will provide the intermediate visual state image after you make each decision. In the provided intermediate visual state image, the faces within specific areas are correctly removed by bounding box masks, which can help you verify the correctness of your previous judgment as well as offer a foundation for executing subsequent judgments. Note that you must make the final judgment only after we input at least one intermedicate visual state image. The final output format should be <Begin> (1, x), (2, x), (3, x), (4, x) </End>. For example, <Begin> (1, 10), (2, 14), (3, 21), (4, 23) </End>.", "target": "(1, 7), (2, 6), (3, 9), (4, 6)", "images": { "<image_1>": "ViC-Bench/images/counting/1631.png" }, "extra_data": { "step_images": [ "ViC-Bench/images/counting/1631-mask-1.png", "ViC-Bench/images/counting/1631-mask-2.png", "ViC-Bench/images/counting/1631-mask-3.png", "ViC-Bench/images/counting/1631-mask-4.png" ], "total": 28 } } ``` * **instanceId**: A distinctive identifier linked to this particular task instance. * **prompt**: The input prompt for the model, with <image_xx> serving as a placeholder for images. * **target**: The correct answer or expected result. * **images**: A reference to the relevant image file for the task, indicating the location of the image to be analyzed. * **extra_data**: Supplementary data related to the topic that can be utilized for metric calculations. ## Incremental Prompting Information Injection (IPII) ```python SYS_PROMPTs = { "level1":"You are a maze navigation expert. "\ "I will provide you with a 4 x 4 maze diagram, where the red lines represent maze boundaries or walls, indicating impassable areas, while the dark grey lines represent passable areas. "\ "In this maze, you can only move once at each step, and you can only go left, right, up, or down. "\ "Additionally, the diagram includes a starting point 'S' and an ending point 'E'. "\ "In this task, you should carry out your own navigation planning and provide me with a final sequence of moves that can successfully reach the endpoint 'E' from the starting point 'S'. "\ "Moreover, to assist you in making better judgments, I will provide you with the intermediate maze state diagram obtained after each move is executed. "\ "For each step, please reply with only one specific move using the format <Begin>Go XX</End>, where XX can only be selected from Left, Right, Up, Down.", "level2":"You are a maze navigation expert. "\ "I will provide you with a 4 x 4 maze diagram, where the red lines represent maze boundaries or walls, indicating impassable areas, while the dark grey lines represent passable areas. "\ "In this maze, you can only move once at each step, and you can only go left, right, up, or down. "\ "Additionally, the diagram includes a starting point 'S' and an ending point 'E'. "\ "In this task, you should carry out your own navigation planning and provide me with a final sequence of moves that can successfully reach the endpoint 'E' from the starting point 'S'. "\ "Please make sure that after executing the move at each step, you should envision your current position in the maze and update your internal intermediate visual state, rather than remaining in the initial input visual state. "\ "Moreover, to assist you in making better judgments, I will provide you with the intermediate maze state diagram obtained after each move is executed. "\ "For each step, please reply with only one specific move using the format <Begin>Go XX</End>, where XX can only be selected from Left, Right, Up, Down.", "level3":"You are a maze navigation expert. "\ "I will provide you with a 4 x 4 maze diagram, where the red lines represent maze boundaries or walls, indicating impassable areas, while the dark grey lines represent passable areas. "\ "In this maze, you can only move once at each step, and you can only go left, right, up, or down. "\ "Additionally, the diagram includes a starting point 'S' and an ending point 'E'. "\ "The coordinates of 'S' and 'E' are {origin} and {target}, where the first value represents the row index (0-3) and the second value represents the column index (0-3)."\ "In this task, you should carry out your own navigation planning and provide me with a final sequence of moves that can successfully reach the endpoint 'E' from the starting point 'S'. "\ "Please make sure that after executing the move at each step, you should envision your current position in the maze and update your internal intermediate visual state, rather than remaining in the initial input visual state. "\ "Moreover, to assist you in making better judgments, I will provide you with the intermediate maze state diagram obtained after each move is executed. "\ "For each step, please reply with only one specific move using the format <Begin>Go XX</End>, where XX can only be selected from Left, Right, Up, Down." } ``` ## Citation ``` @misc{wu2025vicbenchbenchmarkingvisualinterleavedchainofthought, title={ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations}, author={Xuecheng Wu and Jiaxing Liu and Danlei Huang and Xiaoyu Li and Yifan Wang and Chen Chen and Liya Ma and Xuezhi Cao and Junxiao Xue}, year={2025}, eprint={2505.14404}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.14404}, } ```

# ViC-Bench ![images](./assert/overview.png) ## 关于ViC-Bench 视觉交错思维链（Visual-Interleaved Chain-of-Thought，VI-CoT）能够让多模态大语言模型（Multimodal Large Language Model，MLLM）基于逐步递进的中间视觉状态（Intermediate Visual State，IVS）持续更新自身的认知与决策逻辑，如同人类思考一般。该方法在各类任务中取得了亮眼的成果，推动了相关基准测试的突破性进展。尽管已有不少进展，但现有基准测试多为模型提供相对固定的中间视觉状态，而非自由格式的中间视觉状态，这可能会强制扭曲模型原本的思考轨迹，无法准确评估其内在的推理能力。更关键的是，现有基准测试未系统性探究中间视觉状态对无约束推理性能的影响因素。为填补上述空白，我们提出了一款名为ViC-Bench的专用基准测试集，涵盖迷宫导航、拼图游戏、具身长时序规划以及复杂计数四类典型任务，每个任务均配备专属的自由格式中间视觉状态生成流水线，支持函数调用。为系统性评估VI-CoT能力，我们设计了一套完整的评估框架，融合渐进式三阶段策略与针对性的新型评估指标。此外，我们提出增量提示信息注入（Incremental Prompting Information Injection，IPII）策略，以消融实验的方式探究VI-CoT的提示词影响因素。我们针对18款先进的多模态大语言模型开展了大规模评估，揭示了其VI-CoT能力的关键特征。 ## 数据构建为评估近期VI-CoT方法的发展，学界已涌现出多款基准测试集。如表1所示，尽管已有诸多进展，但其中极少有能为多模态大语言模型提供自由格式中间视觉状态表示的基准测试。以CoMT为例，其主要提供固定格式的中间视觉状态，可能会强制扭曲模型原本的规划轨迹。而MageBench虽支持动态中间视觉状态，但对动作-观测记忆的属性施加了约束。更关键的是，现有基准测试均未系统性评估中间视觉状态对多模态大语言模型无约束推理性能的影响因素（即正向、负向或无影响）。 ![images](./assert/statistics.png) 我们针对四类任务构建了整合图像与视觉状态的结构化处理流程，以辅助模型完成决策：下图展示了四类任务的数据处理方法，具体如下： 1. **迷宫导航**：通过预处理筛选符合要求的迷宫场景，从图像库中选取素材并标记目标区域，分三阶段完成处理。每个处理阶段均会提供中间视觉状态，辅助模型识别正确路径。 2. **拼图游戏**：从图像库中选取合适的拼图碎片，开展预处理并标记拼图目标区域，于第二、三阶段完成处理。每个处理阶段均提供函数调用与中间视觉状态，引导模型完成任务。 3. **具身长时序规划**：先通过预处理保障数据质量，第一阶段通过人工校验与重构操作完成数据准备。在后续处理阶段中，模型需逐步规划以达成各阶段设定的目标。 4. **复杂计数**：通过图像库选取与复杂计数预处理生成测试数据。任务分三阶段完成，每个阶段均提供中间视觉状态，辅助模型准确统计各区域内的人头数量。 ![images](./assert/pipelines.png) ## 评估结果如表2所示，多数多模态大语言模型在第一阶段表现尚可。但在第二阶段，模型性能出现显著下滑，表明当前多模态大语言模型在开放式空间推理与感知方面存在局限。而在第三阶段，借助自由格式中间视觉状态的支持，所有模型的全局准确率（ACC）与细粒度观测分数（R_o）均实现提升，最终获得亮眼的ThinkGain指标，这证明了自由格式中间视觉状态在弥补空间认知缺陷方面的有效性。 ![images](./assert/performance.png) ## 三阶段数据样本 ### 阶段1 json { "instanceId": 142353922, "prompt": "<image_1>You are a complex counting expert. The given input image exist numerous human heads and are divided into four areas named 1, 2, 3, 4 by irregular lines. In this task, you need to correctly count the number of human heads in each area sequentially from 1 to 4 and sum them up to determine the total number of heads in the given input image. Please select the most appropriate option you think from the provided four options. A. 44 B. 39 C. 34 D. 29", "target": "B", "images": { "<image_1>": "ViC-Bench/images/counting/2170.png" }, "extra_data": { "options": [ 44, 39, 34, 29 ], "split": "(1, 9), (2, 11), (3, 11), (4, 8)" } } ### 阶段2 json { "instanceId": 142354430, "prompt": "<image_1>You are a complex counting expert. The given input image exist numerous human heads and are divided into four areas named 1, 2, 3, 4 by irregular lines. In this task, you need to correctly count the number of human heads in each area. The final answer format should be <Begin>(1, x), (2, x), (3, x), (4, x)</End>. For example, <Begin>(1, 10), (2, 14), (3, 21), (4, 23)</End>.", "target": "(1, 8), (2, 9), (3, 12), (4, 11)", "images": { "<image_1>": "ViC-Bench/images/counting/2882.png" }, "extra_data": { "total": 40 } } ### 阶段3 json { "instanceId": 142354469, "prompt": "<image_1>You are a complex counting expert. The given input image exist numerous human heads and are divided into four areas named {1, 2, 3, 4} by irregular lines. In this task, you need to correctly count the number of human heads in each area. Before making decision for each area, you can think, plan, and even reflect step by step, and then output your final judgement. The output decision format at each step should be <Begin> (x, y),</End>, where x denotes the area name (1, 2, 3, or 4) and y refers to head number. In addition, to assist you in making the final correct judgement, we will provide the intermediate visual state image after you make each decision. In the provided intermediate visual state image, the faces within specific areas are correctly removed by bounding box masks, which can help you verify the correctness of your previous judgment as well as offer a foundation for executing subsequent judgments. Note that you must make the final judgment only after we input at least one intermedicate visual state image. The final output format should be <Begin> (1, x), (2, x), (3, x), (4, x) </End>. For example, <Begin> (1, 10), (2, 14), (3, 21), (4, 23) </End>.", "target": "(1, 7), (2, 6), (3, 9), (4, 6)", "images": { "<image_1>": "ViC-Bench/images/counting/1631.png" }, "extra_data": { "step_images": [ "ViC-Bench/images/counting/1631-mask-1.png", "ViC-Bench/images/counting/1631-mask-2.png", "ViC-Bench/images/counting/1631-mask-3.png", "ViC-Bench/images/counting/1631-mask-4.png" ], "total": 28 } } * **instanceId**：与该任务实例对应的唯一标识符。 * **prompt**：模型的输入提示词，其中<image_xx>为图像占位符。 * **target**：正确答案或预期输出结果。 * **images**：任务相关图像文件的引用，指明待分析图像的存储位置。 * **extra_data**：与任务相关的辅助数据，可用于评估指标的计算。 ## 增量提示信息注入（Incremental Prompting Information Injection, IPII）策略 python SYS_PROMPTs = { "level1":"You are a maze navigation expert. " "I will provide you with a 4 x 4 maze diagram, where the red lines represent maze boundaries or walls, indicating impassable areas, while the dark grey lines represent passable areas. " "In this maze, you can only move once at each step, and you can only go left, right, up, or down. " "Additionally, the diagram includes a starting point 'S' and an ending point 'E'. " "In this task, you should carry out your own navigation planning and provide me with a final sequence of moves that can successfully reach the endpoint 'E' from the starting point 'S'. " "Moreover, to assist you in making better judgments, I will provide you with the intermediate maze state diagram obtained after each move is executed. " "For each step, please reply with only one specific move using the format <Begin>Go XX</End>, where XX can only be selected from Left, Right, Up, Down.", "level2":"You are a maze navigation expert. " "I will provide you with a 4 x 4 maze diagram, where the red lines represent maze boundaries or walls, indicating impassable areas, while the dark grey lines represent passable areas. " "In this maze, you can only move once at each step, and you can only go left, right, up, or down. " "Additionally, the diagram includes a starting point 'S' and an ending point 'E'. " "In this task, you should carry out your own navigation planning and provide me with a final sequence of moves that can successfully reach the endpoint 'E' from the starting point 'S'. " "Please make sure that after executing the move at each step, you should envision your current position in the maze and update your internal intermediate visual state, rather than remaining in the initial input visual state. " "Moreover, to assist you in making better judgments, I will provide you with the intermediate maze state diagram obtained after each move is executed. " "For each step, please reply with only one specific move using the format <Begin>Go XX</End>, where XX can only be selected from Left, Right, Up, Down.", "level3":"You are a maze navigation expert. " "I will provide you with a 4 x 4 maze diagram, where the red lines represent maze boundaries or walls, indicating impassable areas, while the dark grey lines represent passable areas. " "In this maze, you can only move once at each step, and you can only go left, right, up, or down. " "Additionally, the diagram includes a starting point 'S' and an ending point 'E'. " "The coordinates of 'S' and 'E' are {origin} and {target}, where the first value represents the row index (0-3) and the second value represents the column index (0-3)." "In this task, you should carry out your own navigation planning and provide me with a final sequence of moves that can successfully reach the endpoint 'E' from the starting point 'S'. " "Please make sure that after executing the move at each step, you should envision your current position in the maze and update your internal intermediate visual state, rather than remaining in the initial input visual state. " "Moreover, to assist you in making better judgments, I will provide you with the intermediate maze state diagram obtained after each move is executed. " "For each step, please reply with only one specific move using the format <Begin>Go XX</End>, where XX can only be selected from Left, Right, Up, Down." } ## 引用 @misc{wu2025vicbenchbenchmarkingvisualinterleavedchainofthought, title={ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations}, author={Xuecheng Wu and Jiaxing Liu and Danlei Huang and Xiaoyu Li and Yifan Wang and Chen Chen and Liya Ma and Xuezhi Cao and Junxiao Xue}, year={2025}, eprint={2505.14404}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.14404}, }

提供机构：

maas

创建时间：

2025-07-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集