VisualToolBench

Name: VisualToolBench
Creator: maas
Published: 2026-05-10 17:19:03
License: 暂无描述

魔搭社区2026-05-10 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/ScaleAI/VisualToolBench

下载链接

链接失效反馈

官方服务：

资源简介：

# VisToolBench Dataset A benchmark dataset for evaluating vision-language models on tool-use tasks. ## Dataset Statistics - **Total samples**: 1204 - **Single-turn**: 603 - **Multi-turn**: 601 ## Schema | Column | Type | Description | |--------|------|-------------| | `id` | string | Unique task identifier | | `turncase` | string | Either "single-turn" or "multi-turn" | | `num_turns` | int | Number of conversation turns (1 for single-turn) | | `prompt_category` | string | Task category (e.g., "medical", "scientific", "general") | | `eval_focus` | string | What aspect is being evaluated (e.g., "visual_reasoning", "tool_use") | | `turn_prompts` | List[string] | Per-turn prompts (single-turn → list of length 1) | | `turn_golden_answers` | List[string] | Per-turn golden answers | | `turn_tool_trajectories` | List[string] | Per-turn tool trajectories (JSON strings) | | `rubrics_by_turn` | List[string] | Per-turn rubric dicts as JSON strings (includes weights + metadata) | | `images` | List[Image] | Flat list of all images (HF viewer shows these) | | `images_by_turn` | List[List[Image]] | Images grouped by turn (to know which image belongs to which turn) | | `num_images` | int | Total images in `images` | ## Rubrics Format Each rubric entry contains: - `description`: What the rubric evaluates - `weight`: Importance weight (1-5) - `objective/subjective`: Whether evaluation is objective or subjective - `explicit/implicit`: Whether the answer is explicit or implicit in the image - `category`: List of categories (e.g., "instruction following", "truthfulness") - `critical`: Whether this is a critical rubric ("yes"/"no") - `final_answer`: Whether this relates to the final answer ("yes"/"no") ## Usage ```python from datasets import load_dataset # Load the dataset ds = load_dataset("path/to/dataset") # Access a sample sample = ds['test'][0] print(sample['turn_prompts']) # list[str] print(sample['images'][0]) # PIL Image (first image overall) print(sample['images_by_turn'][0]) # list of PIL Images for turn 1 # Parse rubrics for turn 1 import json turn1_rubrics = json.loads(sample['rubrics_by_turn'][0]) for rubric_id, rubric in turn1_rubrics.items(): print(f"{rubric['description']} (weight: {rubric['weight']})") ``` ## Splits - `test`: Full dataset (1204 samples) ## Citation ```bibtex @article{guo2025beyond, title={Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning}, author={Guo, Xingang and Tyagi, Utkarsh and Gosai, Advait and Vergara, Paula and Park, Jayeon and Montoya, Ernesto Gabriel Hern{\'a}ndez and Zhang, Chen Bo Calvin and Hu, Bin and He, Yunzhong and Liu, Bing and others}, journal={arXiv preprint arXiv:2510.12712}, year={2025} } ```

VisuAlToolBench是一款极具挑战性的基准测试集，用于评估多模态大语言模型（Multimodal Large Language Model）在支持工具调用场景下的视觉感知、图像变换与推理能力。该测试集旨在评估模型是否兼具两项核心能力：不仅能够理解图像内容并展开思考，还能借助图像作为工具开展推理——通过主动操控视觉素材（如裁剪、编辑、画质增强）并集成通用工具解决复杂任务。本数据集涵盖跨多个领域的单轮与多轮任务，每项任务均配备用于系统化评估的详细评分准则。`data/`目录下的Parquet文件已由Hub自动建立索引，为数据集查看器提供支持。相关论文：[《超越视觉：评估支持工具调用的图像感知、变换与推理能力的多模态大语言模型》](https://static.scale.com/uploads/654197dc94d34f66c0f5184e/vtb_paper.pdf)

提供机构：

maas

创建时间：

2025-10-18

搜集汇总

数据集介绍