MM-Vet v2

Name: MM-Vet v2
Creator: 新加坡国立大学, 微软, 先进微设备
Published: 2024-08-02 01:59:54
License: 暂无描述

arXiv2024-08-02 更新2024-08-05 收录

下载链接：

https://github.com/yuweihao/MM-Vet

下载链接

链接失效反馈

官方服务：

资源简介：

MM-Vet v2数据集由新加坡国立大学、微软和先进微设备公司共同创建，旨在评估大型多模态模型的综合能力。该数据集包含517个高质量的评估样本，涵盖了从日常生活到专业/行业应用的多种场景。数据集的创建过程包括由研究人员设计问题和收集参考答案，确保了数据集的高质量和广泛应用性。MM-Vet v2特别引入了“图像-文本序列理解”能力，用于评估模型处理图像和文本序列数据的能力，旨在解决多模态模型在实际应用中的复杂任务处理问题。

The MM-Vet v2 dataset, co-created by the National University of Singapore, Microsoft, and Advanced Micro Devices, is designed to comprehensively evaluate the capabilities of large multimodal models. This dataset includes 517 high-quality evaluation samples covering diverse scenarios ranging from daily life to professional and industrial applications. The dataset's creation process involves researchers designing questions and collecting reference answers, which ensures the high quality and broad applicability of the dataset. Notably, MM-Vet v2 introduces the "image-text sequence understanding" capability to evaluate models' ability to process image and text sequence data, aiming to address the complex task processing challenges faced by multimodal models in real-world applications.

提供机构：

新加坡国立大学, 微软, 先进微设备

创建时间：

2024-08-02

原始信息汇总

MM-Vet 数据集概述

数据集简介

MM-Vet 数据集用于评估大型多模态模型在集成能力方面的表现，涵盖了识别、OCR、知识、语言生成、空间感知和数学等多个核心视觉语言能力。

数据集版本

MM-Vet v2: 扩展了 MM-Vet，新增了“图像-文本序列理解”能力，并扩大了评估集的规模，同时保持高质量。

数据集下载

数据集可以从以下链接下载： Download Dataset

数据集评估

评估步骤

安装依赖: 使用 pip install openai>=1 安装 openai 包，并获取 GPT-4/GPT-3.5 API 访问权限。
下载数据集: 从上述链接下载并解压数据集。
模型推理: 使用提供的推理脚本进行模型推理，并将结果保存为 JSON 格式。
评估模型: 使用提供的评估脚本对模型输出进行评估。

推理脚本示例

bash image_detail=high # 或 auto, low 参考 https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding

python inference/gpt4v.py --mmvet_path /path/to/mm-vet --image_detail ${image_detail}

bash python inference/gemini_vision.py --mmvet_path /path/to/mm-vet

评估脚本示例

bash python mm-vet_evaluator.py --mmvet_path /path/to/mm-vet --result_file results/llava_llama2_13b_chat.json

数据集样本

数据集包含多个样本，每个样本都包含一个问题和相应的答案，以及所需的视觉语言能力。以下是部分样本示例：

样本 1

Q: What occasions would someone use this meme? GT: This meme, commonly known as "Screaming Panda," is typically used to express shock, surprise, or fear. Required capabilities: Recognition, knowledge, language generation

样本 2

Q: How many tomatoes are there? GT: 5 Required capabilities: Recognition

样本 3

Q: What is located to the right of the shampoo? GT: conditioner Required capabilities: OCR, spatial awareness

样本 4

Q: Which room is bigger, the double garage or the living room? GT: double garage Required capabilities: OCR, spatial awareness, math

样本 5

Q: On the right desk, what is to the left of the laptop? GT: table lamp <OR> desk lamp Required capabilities: Recognition, spatial awareness

样本 6

Q: What are all the scene text in the image? GT: 5:30PM<AND>88%<AND>Mario Kart 8 Deluxe<AND>MARIO KART 8 DELUXE<AND>SUPER MARIO ODYSSEY<AND>THE LEGEND OF ZELDA<AND>BREATH OF WILD<AND>Options<AND>Start Required capabilities: OCR

样本 7

Q: How many gallons of supreme gasoline can I get with $50? GT: 13.6 <OR> 13.7 Required capabilities: OCR, math

样本 8

Q: In which country was this photo taken? GT: Australia Required capabilities: Recognition, knowledge

样本 9

Q: Can you explain this meme? GT: This meme is a humorous take on procrastination and the tendency to delay tasks until a specific time. Required capabilities: Recognition, OCR, knowledge, language generation

样本 10

Q: The graph below shows the long-term international migration, UK, 1999-2008. GT: The chart gives information about UK immigration, emigration and net migration between 1999 and 2008. Required capabilities: Recognition, OCR, language generation, spatial awareness

样本 11

Q: Which car is on the parking spot 33? GT: no <OR> empty Required capabilities: Recognition, OCR, spatial awareness

样本 12

Q: Is this apple organic? GT: yes Required capabilities: Recognition, OCR

样本 13

Q: Which are producers in this food web? GT: Phytoplankton <AND> Seaweed Required capabilities: OCR, knowledge, spatial awareness

样本 14

Q: Is the person bigger than the car? GT: no Required capabilities: Recognition, knowledge, spatial awareness

样本 15

Q: The table below gives information about the underground railway systems in six cities. GT: The table shows data about the underground rail networks in six major cities. Required capabilities: OCR, language generation, spatial awareness

样本 16

Q: What will the girl on the right write on the board? GT: 14 Required capabilities: Recognition, OCR, spatial awareness, math

更多样本请参考：更多样本

搜集汇总

数据集介绍

构建方式

MM-Vet v2数据集的构建方式主要通过引入新的‘图像-文本序列理解’能力，以评估模型处理图像和文本序列数据的能力。研究团队设计并收集了517个问题，涵盖从日常生活到专家应用的多种场景，这些问题不仅继承了MM-Vet的六种核心能力，还扩展了新的序列理解能力。对于需要长文本回答的问题，首先使用GPT-4V生成答案草稿，然后由专家进行校正和重述，以确保答案的高质量。

使用方法

使用MM-Vet v2数据集进行模型评估时，首先将问题、参考答案和模型输出填充到预定义的模板中，然后通过GPT-4进行评分。评分范围从0到1，表示模型输出的正确性。为了减少GPT-4输出的不确定性，每个样本的评分会进行五次，并取平均值。评估结果不仅包括各核心能力的得分，还涵盖了能力集成的评估，从而全面反映模型的多模态处理能力。

背景与挑战

背景概述

随着大规模多模态模型（LMMs）的快速发展，评估这些模型在复杂任务中的综合能力变得尤为重要。MM-Vet v2数据集由新加坡国立大学和微软的研究团队共同开发，旨在评估大型多模态模型在视觉语言任务中的综合能力。该数据集于2024年发布，主要研究人员包括Weihao Yu、Zhengyuan Yang等，其核心研究问题是如何有效评估模型在处理图像与文本序列数据时的理解能力。MM-Vet v2不仅继承了MM-Vet的六项核心能力评估，还新增了‘图像-文本序列理解’能力，使其成为评估LMMs的重要基准，对推动多模态模型研究具有重要影响。

当前挑战

MM-Vet v2数据集在构建过程中面临多项挑战。首先，如何设计高质量的评估样本，确保问题涵盖多种实际场景，是一个主要难题。其次，数据集需要处理图像与文本序列的复杂交互，这对模型的多模态理解能力提出了更高要求。此外，数据集的扩展也带来了样本多样性和质量控制的挑战。尽管如此，MM-Vet v2通过引入新的评估能力和扩展样本数量，为评估先进LMMs提供了更为全面的工具，但也需要在保持高质量的同时，解决样本多样性和复杂性带来的评估难题。

常用场景

经典使用场景

MM-Vet v2数据集的经典使用场景在于评估大型多模态模型的综合能力，特别是其在处理图像与文本序列数据时的理解能力。通过设计包含多种视觉与语言任务的复杂问题，MM-Vet v2能够全面测试模型在识别、知识推理、空间感知、语言生成、OCR和数学计算等多方面的表现。

解决学术问题

MM-Vet v2数据集解决了当前多模态模型评估中的一大难题，即如何有效评估模型在处理复杂视觉与语言序列数据时的能力。传统的评估方法往往局限于单一图像与文本对，无法全面反映模型在实际应用中的表现。MM-Vet v2通过引入图像-文本序列理解这一新能力，填补了这一空白，为学术界提供了一个更为全面和精确的评估工具。

实际应用

在实际应用中，MM-Vet v2数据集可用于开发和优化多模态模型，特别是在需要处理复杂视觉与语言交互的场景中，如智能助手、自动驾驶、医疗诊断等。通过使用MM-Vet v2进行模型训练和评估，开发者能够更好地理解和提升模型在真实世界中的表现，从而推动相关技术的实际应用和商业化进程。

数据集最近研究