MM-Vet v2
收藏MM-Vet 数据集概述
数据集简介
MM-Vet 数据集用于评估大型多模态模型在集成能力方面的表现,涵盖了识别、OCR、知识、语言生成、空间感知和数学等多个核心视觉语言能力。
数据集版本
- MM-Vet v2: 扩展了 MM-Vet,新增了“图像-文本序列理解”能力,并扩大了评估集的规模,同时保持高质量。
数据集下载
数据集可以从以下链接下载: Download Dataset
数据集评估
评估步骤
- 安装依赖: 使用
pip install openai>=1安装 openai 包,并获取 GPT-4/GPT-3.5 API 访问权限。 - 下载数据集: 从上述链接下载并解压数据集。
- 模型推理: 使用提供的推理脚本进行模型推理,并将结果保存为 JSON 格式。
- 评估模型: 使用提供的评估脚本对模型输出进行评估。
推理脚本示例
bash image_detail=high # 或 auto, low 参考 https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding
python inference/gpt4v.py --mmvet_path /path/to/mm-vet --image_detail ${image_detail}
bash python inference/gemini_vision.py --mmvet_path /path/to/mm-vet
评估脚本示例
bash python mm-vet_evaluator.py --mmvet_path /path/to/mm-vet --result_file results/llava_llama2_13b_chat.json
数据集样本
数据集包含多个样本,每个样本都包含一个问题和相应的答案,以及所需的视觉语言能力。以下是部分样本示例:
样本 1
Q: What occasions would someone use this meme? GT: This meme, commonly known as "Screaming Panda," is typically used to express shock, surprise, or fear. Required capabilities: Recognition, knowledge, language generation
样本 2
Q: How many tomatoes are there? GT: 5 Required capabilities: Recognition
样本 3
Q: What is located to the right of the shampoo? GT: conditioner Required capabilities: OCR, spatial awareness
样本 4
Q: Which room is bigger, the double garage or the living room? GT: double garage Required capabilities: OCR, spatial awareness, math
样本 5
Q: On the right desk, what is to the left of the laptop? GT: table lamp <OR> desk lamp Required capabilities: Recognition, spatial awareness
样本 6
Q: What are all the scene text in the image? GT: 5:30PM<AND>88%<AND>Mario Kart 8 Deluxe<AND>MARIO KART 8 DELUXE<AND>SUPER MARIO ODYSSEY<AND>THE LEGEND OF ZELDA<AND>BREATH OF WILD<AND>Options<AND>Start Required capabilities: OCR
样本 7
Q: How many gallons of supreme gasoline can I get with $50? GT: 13.6 <OR> 13.7 Required capabilities: OCR, math
样本 8
Q: In which country was this photo taken? GT: Australia Required capabilities: Recognition, knowledge
样本 9
Q: Can you explain this meme? GT: This meme is a humorous take on procrastination and the tendency to delay tasks until a specific time. Required capabilities: Recognition, OCR, knowledge, language generation
样本 10
Q: The graph below shows the long-term international migration, UK, 1999-2008. GT: The chart gives information about UK immigration, emigration and net migration between 1999 and 2008. Required capabilities: Recognition, OCR, language generation, spatial awareness
样本 11
Q: Which car is on the parking spot 33? GT: no <OR> empty Required capabilities: Recognition, OCR, spatial awareness
样本 12
Q: Is this apple organic? GT: yes Required capabilities: Recognition, OCR
样本 13
Q: Which are producers in this food web? GT: Phytoplankton <AND> Seaweed Required capabilities: OCR, knowledge, spatial awareness
样本 14
Q: Is the person bigger than the car? GT: no Required capabilities: Recognition, knowledge, spatial awareness
样本 15
Q: The table below gives information about the underground railway systems in six cities. GT: The table shows data about the underground rail networks in six major cities. Required capabilities: OCR, language generation, spatial awareness
样本 16
Q: What will the girl on the right write on the board? GT: 14 Required capabilities: Recognition, OCR, spatial awareness, math
更多样本请参考:更多样本

- 1MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities新加坡国立大学, 微软, 先进微设备 · 2024年



