MIBench
收藏魔搭社区2025-12-12 更新2024-10-19 收录
下载链接:
https://modelscope.cn/datasets/iic/MIBench
下载链接
链接失效反馈官方服务:
资源简介:
# MIBench
This dataset is from our EMNLP'24 (main conference) paper [MIBench: Evaluating Multimodal Large Language Models over Multiple Images](https://arxiv.org/abs/2407.15272)
## Introduction
<div align="center">
<img src="overview.webp" alt="Overview" style="width: 500px; height: auto;">
</div>
**MIBench** covers 13 sub-tasks in three typical multi-image scenarios: Multi-Image Instruction, Multimodal Knowledge-Seeking and Multimodal In-Context Learning.
- **Multi-Image Instruction**: This scenario includes instructions for perception, comparison and reasoning across multiple input images. According to the semantic types of the instructions, it is divided into five sub-tasks: General Comparison, Subtle Difference, Visual Referring, Temporal Reasoning and Logical Reasoning.
- **Multimodal Knowledge-Seeking**: This scenario examines the ability of MLLMs to acquire relevant information from external knowledge, which is provided in an interleaved image-text format. Based on the forms of external knowledge, we categorize this scenario into four sub-tasks: Fine-grained Visual Recognition, Text-Rich Images VQA, Vision-linked Textual Knowledge and Text-linked Visual Knowledge.
- **Multimodal In-Context Learning**: In-context learning is another popular scenario, in which MLLMs respond to visual questions while being provided with a series of multimodal demonstrations. To evaluate the model’s MIC ability in a fine-grained manner, we categorize the MIC scenario into four distinct tasks: Close-ended VQA, Open-ended VQA, Hallucination and Demo-based Task Learning.
## Examples
The following image shows the examples of the multi-image scenarios with a total of 13 sub-tasks. The correct answers are marked in blue.

## Data format
Below shows an example of the dataset format. The `<image>` in the `question` field indicates the location of the images. Note that to ensure better reproducibility, for the Multimodal In-Context Learning scenario, we store the context information of different shots in the `context` field.
```
{
"id": "general_comparison_1",
"image": [
"image/general_comparison/test1-902-0-img0.png",
"image/general_comparison/test1-902-0-img1.png"
],
"question": "Left image is <image>. Right image is <image>. Question: Is the subsequent sentence an accurate portrayal of the two images? One lemon is cut in half and has both halves facing outward.",
"options": [
"Yes",
"No"
],
"answer": "Yes",
"task": "general_comparison",
"type": "multiple-choice",
"context": null
},
```
## Citation
If you find this dataset useful for your work, please consider citing our paper:
```
@article{liu2024mibench,
title={Mibench: Evaluating multimodal large language models over multiple images},
author={Liu, Haowei and Zhang, Xi and Xu, Haiyang and Shi, Yaya and Jiang, Chaoya and Yan, Ming and Zhang, Ji and Huang, Fei and Yuan, Chunfeng and Li, Bing and others},
journal={arXiv preprint arXiv:2407.15272},
year={2024}
}
```
# MIBench
本数据集源自我们发表于EMNLP 2024(主会议)的论文《MIBench:面向多图像的多模态大语言模型评测》(原英文标题:*MIBench: Evaluating Multimodal Large Language Models over Multiple Images*,论文链接:https://arxiv.org/abs/2407.15272)
## 简介
<div align="center">
<img src="overview.webp" alt="整体架构" style="width: 500px; height: auto;">
</div>
**MIBench** 涵盖三类典型多图像场景下的13个子任务:多图像指令、多模态知识查询与多模态上下文学习。
- **多图像指令**:该场景包含针对多幅输入图像的感知、对比与推理类指令。依据指令的语义类型,可划分为5个子任务:通用对比、细微差异、视觉指代、时序推理与逻辑推理。
- **多模态知识查询**:该场景用于评测多模态大语言模型(Multimodal Large Language Model, MLLM)从外部知识中获取相关信息的能力,外部知识以图文交错的格式提供。依据外部知识的呈现形式,该场景可分为4个子任务:细粒度视觉识别、富文本图像视觉问答、视觉关联文本知识与文本关联视觉知识。
- **多模态上下文学习**:上下文学习是当前主流评测场景之一,在此场景中,多模态大语言模型需在接收一系列多模态演示样例的前提下完成视觉问答任务。为实现对模型上下文学习能力的细粒度评测,我们将该场景划分为4个独立子任务:封闭式视觉问答、开放式视觉问答、幻觉检测与基于演示的任务学习。
## 示例
下图展示了覆盖全部13个子任务的三类多图像场景示例,正确答案以蓝色标注。

## 数据格式
下文展示了该数据集的格式示例。`question`字段中的`<image>`用于标注图像的占位位置。为保障实验可复现性,针对多模态上下文学习场景,我们将不同样本数的上下文信息存储于`context`字段中。
{
"id": "general_comparison_1",
"image": [
"image/general_comparison/test1-902-0-img0.png",
"image/general_comparison/test1-902-0-img1.png"
],
"question": "左图为<image>,右图为<image>。问题:下述语句是否准确描述了这两幅图像?一个柠檬被切成两半,且两半都朝外摆放。",
"options": [
"Yes",
"No"
],
"answer": "Yes",
"task": "general_comparison",
"type": "multiple-choice",
"context": null
},
## 引用
若您的研究工作得益于本数据集,请引用如下论文:
@article{liu2024mibench,
title={Mibench: Evaluating multimodal large language models over multiple images},
author={Liu, Haowei and Zhang, Xi and Xu, Haiyang and Shi, Yaya and Jiang, Chaoya and Yan, Ming and Zhang, Ji and Huang, Fei and Yuan, Chunfeng and Li, Bing and others},
journal={arXiv preprint arXiv:2407.15272},
year={2024}
}
提供机构:
maas
创建时间:
2024-10-14



