VL-RewardBench

Name: VL-RewardBench
Creator: maas
Published: 2026-01-06 16:22:04
License: 暂无描述

魔搭社区2026-01-06 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/MMInstruction/VL-RewardBench

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for VLRewardBench Project Page: https://vl-rewardbench.github.io ## Dataset Summary VLRewardBench is a comprehensive benchmark designed to evaluate vision-language generative reward models (VL-GenRMs) across visual perception, hallucination detection, and reasoning tasks. The benchmark contains 1,250 high-quality examples specifically curated to probe model limitations. ## Dataset Structure Each instance consists of multimodal queries spanning three key domains: - General multimodal queries from real users - Visual hallucination detection tasks - Multimodal knowledge and mathematical reasoning ### Data Fields Key fields: - `id`: instance id - `query`: text query of the multimodal prompt - `image`: image input of the multimodal prompt - `response`: list of two candidate responses generated by models; - `human_ranking`: rank of the two responses `[0, 1]` denotes the first one is preferred; `[1, 0]` denotes the second one is better; - `models`: the corresponding models generating the response. useful for instances from `wildvision` subset - `query_source` : source dataset for the instance : - WildVision - POVID - RLAIF-V - RLHF-V - MMMU-Pro - MathVerse ## Annotations - Small LVLMs were used to filter challenging samples - Strong commercial models generated responses with explicit reasoning paths - GPT-4o performed quality assessment - All preference labels underwent human verification ## Usage Intended Uses The dataset is intended for research use only, specifically for: - Evaluating and improving vision-language reward models - Studying model limitations in visual perception and reasoning - Developing better multimodal AI systems ## License Research use only. Usage is restricted by the license agreements of GPT-4o and Claude. ## Citation Information ```bibtex @article{VLRewardBench, title={VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models}, author={Lei Li and Yuancheng Wei and Zhihui Xie and Xuqing Yang and Yifan Song and Peiyi Wang and Chenxin An and Tianyu Liu and Sujian Li and Bill Yuchen Lin and Lingpeng Kong and Qi Liu}, year={2024}, journal={arXiv preprint arXiv:2411.17451} } ```

# VLRewardBench 数据集卡片项目主页： https://vl-rewardbench.github.io ## 数据集概述 VLRewardBench是一款综合性基准测试集，旨在从视觉感知、幻觉检测与推理任务三个维度，对视觉语言生成式奖励模型（vision-language generative reward models，VL-GenRMs）进行评估。该基准集共包含1250个经过精心筛选的高质量样本，专门用于探测模型的性能局限。 ## 数据集结构每个样本均包含覆盖三大核心领域的多模态查询： - 真实用户提出的通用多模态查询 - 视觉幻觉检测任务 - 多模态知识与数学推理 ### 数据字段核心字段包括： - `id`：样本编号 - `query`：多模态提示的文本查询内容 - `image`：多模态提示的图像输入 - `response`：模型生成的两份候选响应列表 - `human_ranking`：两份响应的人类偏好排序：`[0, 1]`表示第一份响应更受偏好；`[1, 0]`表示第二份响应更优 - `models`：生成对应响应的模型，仅适用于`wildvision`子集的样本 - `query_source`：样本来源数据集： - WildVision - POVID - RLAIF-V - RLHF-V - MMMU-Pro - MathVerse ## 标注说明 - 采用小型大视觉语言模型（small LVLMs）筛选高难度样本 - 采用商用高性能模型生成带有明确推理路径的响应 - 采用GPT-4o进行质量评估 - 所有偏好标签均经过人工校验 ## 使用场景本数据集仅用于科研用途，具体包括： - 评估与优化视觉语言生成式奖励模型 - 研究模型在视觉感知与推理任务中的性能局限 - 开发更优秀的多模态人工智能系统 ## 使用许可仅可用于科研用途，使用需遵循GPT-4o与Claude的许可协议限制。 ## 引用信息 bibtex @article{VLRewardBench, title={VLRewardBench：面向视觉语言生成式奖励模型的挑战性基准测试集}, author={Lei Li and Yuancheng Wei and Zhihui Xie and Xuqing Yang and Yifan Song and Peiyi Wang and Chenxin An and Tianyu Liu and Sujian Li and Bill Yuchen Lin and Lingpeng Kong and Qi Liu}, year={2024}, journal={arXiv preprint arXiv:2411.17451} }

提供机构：

maas

创建时间：

2025-02-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集