VisualToolBench
收藏魔搭社区2026-05-10 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/ScaleAI/VisualToolBench
下载链接
链接失效反馈官方服务:
资源简介:
# VisToolBench Dataset
A benchmark dataset for evaluating vision-language models on tool-use tasks.
## Dataset Statistics
- **Total samples**: 1204
- **Single-turn**: 603
- **Multi-turn**: 601
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `id` | string | Unique task identifier |
| `turncase` | string | Either "single-turn" or "multi-turn" |
| `num_turns` | int | Number of conversation turns (1 for single-turn) |
| `prompt_category` | string | Task category (e.g., "medical", "scientific", "general") |
| `eval_focus` | string | What aspect is being evaluated (e.g., "visual_reasoning", "tool_use") |
| `turn_prompts` | List[string] | Per-turn prompts (single-turn → list of length 1) |
| `turn_golden_answers` | List[string] | Per-turn golden answers |
| `turn_tool_trajectories` | List[string] | Per-turn tool trajectories (JSON strings) |
| `rubrics_by_turn` | List[string] | Per-turn rubric dicts as JSON strings (includes weights + metadata) |
| `images` | List[Image] | Flat list of all images (HF viewer shows these) |
| `images_by_turn` | List[List[Image]] | Images grouped by turn (to know which image belongs to which turn) |
| `num_images` | int | Total images in `images` |
## Rubrics Format
Each rubric entry contains:
- `description`: What the rubric evaluates
- `weight`: Importance weight (1-5)
- `objective/subjective`: Whether evaluation is objective or subjective
- `explicit/implicit`: Whether the answer is explicit or implicit in the image
- `category`: List of categories (e.g., "instruction following", "truthfulness")
- `critical`: Whether this is a critical rubric ("yes"/"no")
- `final_answer`: Whether this relates to the final answer ("yes"/"no")
## Usage
```python
from datasets import load_dataset
# Load the dataset
ds = load_dataset("path/to/dataset")
# Access a sample
sample = ds['test'][0]
print(sample['turn_prompts']) # list[str]
print(sample['images'][0]) # PIL Image (first image overall)
print(sample['images_by_turn'][0]) # list of PIL Images for turn 1
# Parse rubrics for turn 1
import json
turn1_rubrics = json.loads(sample['rubrics_by_turn'][0])
for rubric_id, rubric in turn1_rubrics.items():
print(f"{rubric['description']} (weight: {rubric['weight']})")
```
## Splits
- `test`: Full dataset (1204 samples)
## Citation
```bibtex
@article{guo2025beyond,
title={Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning},
author={Guo, Xingang and Tyagi, Utkarsh and Gosai, Advait and Vergara, Paula and Park, Jayeon and Montoya, Ernesto Gabriel Hern{\'a}ndez and Zhang, Chen Bo Calvin and Hu, Bin and He, Yunzhong and Liu, Bing and others},
journal={arXiv preprint arXiv:2510.12712},
year={2025}
}
```
VisuAlToolBench是一款极具挑战性的基准测试集,用于评估多模态大语言模型(Multimodal Large Language Model)在支持工具调用场景下的视觉感知、图像变换与推理能力。该测试集旨在评估模型是否兼具两项核心能力:不仅能够理解图像内容并展开思考,还能借助图像作为工具开展推理——通过主动操控视觉素材(如裁剪、编辑、画质增强)并集成通用工具解决复杂任务。本数据集涵盖跨多个领域的单轮与多轮任务,每项任务均配备用于系统化评估的详细评分准则。`data/`目录下的Parquet文件已由Hub自动建立索引,为数据集查看器提供支持。相关论文:[《超越视觉:评估支持工具调用的图像感知、变换与推理能力的多模态大语言模型》](https://static.scale.com/uploads/654197dc94d34f66c0f5184e/vtb_paper.pdf)
提供机构:
maas
创建时间:
2025-10-18
搜集汇总
数据集介绍

背景与挑战
背景概述
VisualToolBench是一个用于评估视觉语言模型工具使用能力的基准数据集,包含1204个样本,涵盖单轮和多轮任务,涉及医疗、科学和通用等多个类别,支持对视觉推理和工具使用等方面的综合评估。
以上内容由遇见数据集搜集并总结生成



