five

VisualToolBench

收藏
魔搭社区2026-05-10 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/ScaleAI/VisualToolBench
下载链接
链接失效反馈
官方服务:
资源简介:
# VisToolBench Dataset A benchmark dataset for evaluating vision-language models on tool-use tasks. ## Dataset Statistics - **Total samples**: 1204 - **Single-turn**: 603 - **Multi-turn**: 601 ## Schema | Column | Type | Description | |--------|------|-------------| | `id` | string | Unique task identifier | | `turncase` | string | Either "single-turn" or "multi-turn" | | `num_turns` | int | Number of conversation turns (1 for single-turn) | | `prompt_category` | string | Task category (e.g., "medical", "scientific", "general") | | `eval_focus` | string | What aspect is being evaluated (e.g., "visual_reasoning", "tool_use") | | `turn_prompts` | List[string] | Per-turn prompts (single-turn → list of length 1) | | `turn_golden_answers` | List[string] | Per-turn golden answers | | `turn_tool_trajectories` | List[string] | Per-turn tool trajectories (JSON strings) | | `rubrics_by_turn` | List[string] | Per-turn rubric dicts as JSON strings (includes weights + metadata) | | `images` | List[Image] | Flat list of all images (HF viewer shows these) | | `images_by_turn` | List[List[Image]] | Images grouped by turn (to know which image belongs to which turn) | | `num_images` | int | Total images in `images` | ## Rubrics Format Each rubric entry contains: - `description`: What the rubric evaluates - `weight`: Importance weight (1-5) - `objective/subjective`: Whether evaluation is objective or subjective - `explicit/implicit`: Whether the answer is explicit or implicit in the image - `category`: List of categories (e.g., "instruction following", "truthfulness") - `critical`: Whether this is a critical rubric ("yes"/"no") - `final_answer`: Whether this relates to the final answer ("yes"/"no") ## Usage ```python from datasets import load_dataset # Load the dataset ds = load_dataset("path/to/dataset") # Access a sample sample = ds['test'][0] print(sample['turn_prompts']) # list[str] print(sample['images'][0]) # PIL Image (first image overall) print(sample['images_by_turn'][0]) # list of PIL Images for turn 1 # Parse rubrics for turn 1 import json turn1_rubrics = json.loads(sample['rubrics_by_turn'][0]) for rubric_id, rubric in turn1_rubrics.items(): print(f"{rubric['description']} (weight: {rubric['weight']})") ``` ## Splits - `test`: Full dataset (1204 samples) ## Citation ```bibtex @article{guo2025beyond, title={Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning}, author={Guo, Xingang and Tyagi, Utkarsh and Gosai, Advait and Vergara, Paula and Park, Jayeon and Montoya, Ernesto Gabriel Hern{\'a}ndez and Zhang, Chen Bo Calvin and Hu, Bin and He, Yunzhong and Liu, Bing and others}, journal={arXiv preprint arXiv:2510.12712}, year={2025} } ```

VisuAlToolBench是一款极具挑战性的基准测试集,用于评估多模态大语言模型(Multimodal Large Language Model)在支持工具调用场景下的视觉感知、图像变换与推理能力。该测试集旨在评估模型是否兼具两项核心能力:不仅能够理解图像内容并展开思考,还能借助图像作为工具开展推理——通过主动操控视觉素材(如裁剪、编辑、画质增强)并集成通用工具解决复杂任务。本数据集涵盖跨多个领域的单轮与多轮任务,每项任务均配备用于系统化评估的详细评分准则。`data/`目录下的Parquet文件已由Hub自动建立索引,为数据集查看器提供支持。相关论文:[《超越视觉:评估支持工具调用的图像感知、变换与推理能力的多模态大语言模型》](https://static.scale.com/uploads/654197dc94d34f66c0f5184e/vtb_paper.pdf)
提供机构:
maas
创建时间:
2025-10-18
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
VisualToolBench是一个用于评估视觉语言模型工具使用能力的基准数据集,包含1204个样本,涵盖单轮和多轮任务,涉及医疗、科学和通用等多个类别,支持对视觉推理和工具使用等方面的综合评估。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作