# HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models
You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
[Tianrui Guan*](https://tianruiguan.phd), [Fuxiao Liu*](https://fuxiaoliu.github.io/), Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou
🔥🔥🔥
## We welcome everyone to contribute the failure cases of Large Multimodal Models (GPT-4V) to our community!
🔥🔥🔥
Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs. To study these two types of VLM mistakes, i.e., language hallucination and visual illusion, we curated HallusionBench, an image-context reasoning benchmark that is still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed analysis of examples in HallusionBench, which sheds novel insights on the illusion or hallucination of VLMs and how to improve them in the future.
If you find our paper useful, please cite our paper:
```bibtex
@misc{guan2023hallusionbench,
title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models},
author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou},
year={2023},
eprint={2310.14566},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{liu2023mitigating,
title={Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning},
author={Fuxiao Liu and Kevin Lin and Linjie Li and Jianfeng Wang and Yaser Yacoob and Lijuan Wang},
year={2023},
eprint={2306.14565},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
## Updates
- [11/28] 🔥 The full paper is uploaded and can be accessed [here](https://arxiv.org/abs/2310.14566). The dataset is expanded and leaderboard is updated.
- [11/13] 🔥 Evaluation result on LLaVA-1.5 is updated. More model results to come!
- [10/27] 🔥 The [leaderboard](https://paperswithcode.com/sota/visual-question-answering-vqa-on-3) and evaluation code is released! **Welcome to update your model on our leaderboard!**
- [10/24] 🔥 The early report with case analysis and insights is available [here](https://arxiv.org/abs/2310.14566).
- [10/23] 🔥 Please check our previous work on mitigating hallucinations of LMMs ["Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning"](https://github.com/FuxiaoLiu/LRV-Instruction).
## Dataset Download
To keep evaluation simple, we only provide the question in form of yes/no questions.
| Updated on | Questions and Annotations | Figures | Question Count | Figure Count |
| ----------- | :----: | :----: | :----: | :----: |
| Oct 27, 2023 | [HallusionBench.json](./HallusionBench.json) | [hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) | 254 | 69 |
### Evaluation
1. Clone the repo.
```
git clone https://github.com/tianyi-lab/HallusionBench.git
cd ./HallusionBench
```
2. Download the images [hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) and unzip the folder in the same directory.
3. The questions and image locations are saved in `./HallusionBench.json`. The data sample are as follows:
```
{'category': 'VD', 'subcategory': 'illusion', 'visual_input': '1', 'set_id': '0', 'figure_id': '0', 'sample_note': 'circle', 'question_id': '0', 'question': 'Is the right orange circle the same size as the left orange circle?', 'gt_answer_details': 'The right orange circle is the same size as the left orange circle.', 'gt_answer': '1', 'filename': './hallusion_bench/VD/illusion/0_0.png'}
```
The key `visual_input`means whether the question needs visual input like images. If `visual_input=1`, it means the question need visual input. If `visual_input=0`, it means the question doesn't need visual input. It's the text-only question.
4. Run your model on `./HallusionBench.json` and save the ouput file as `./HallusionBench_result.json`. You need to add the output of your model in the key `'model_prediction'`. We provide an sample result [here](./HallusionBench_result_sample.json).
5. Finally, run the following code for evaluation:
```
python evaluation.py
```
You can use your own API key for GPT4 evaluation by editing the code [here](./utils.py#L10).
## Leaderboard
### Definition
* **Visual Dependent (VD) Questions**: questions that do not have an affirmative answer without the visual context.
* **Easy**: Original images that are obtained from Internet.
* **Hard**: Edited images from the original images.
* **Visual Supplement (VS) Questions**: questions that can be answered without the visual input; the visual component merely provides supplemental information.
* **Easy**: No visual input. Uncertain answer without hallucination is also considered correct response.
* **Hard**: With visual input. The answer must follow the provided figure and visual context.
### Metric
* **Accuracy per Figure (Consistency Test)**: Accuracy based on each figure. To make sure the mode truly understand image, we ask variant of questions based on the same knowledge on the same figure, and consider it correct if the model can answer all questions correctly. For example, the model should not give inconsistent responses on the questions "Is A bigger than B?" and "Is B smaller A?".
* **Accuracy per Question**: Accuracy of all questions, including easy and hard questions.
* **Accuracy per Question Pair**: We ask the same questions on similar images (or, with and without images). We consider the same question text on different visual contexts a **question pair** (usually they come in with an *easy* question and a corresponding *hard* question). This metric calculate accuracy of all question pairs.
| Model | Question Pair Acc | Figure Acc | Easy Question Acc | Hard Question Acc | Question Acc | Json |
| ----- | :----: | :----: | :----: | :----: | :----: | :----: |
| **GPT4V** <br />Sep 25, 2023 Version <br />(Human Eval) | 31.42 | 44.22 | 79.56 | 38.37 | 67.58 | [VD](), [VS]() |
| **GPT4V** <br />Sep 25, 2023 Version <br />(GPT Eval) | 28.79 | 39.88 | 75.60 | 37.67 | 65.28 | [VD](), [VS]() |
| **LLaVA-1.5** <br />(Human Eval) | 9.45 | 25.43 | 50.77 | 29.07 | 47.12 | [VD](), [VS]() |
| **LLaVA-1.5** <br />(GPT Eval) | 10.55 | 24.86 | 49.67 | 29.77 | 46.94 | [VD](), [VS]() |
| **BLIP2-T5** <br />(GPT Eval) | 15.16 | 20.52 | 45.49 | 43.49 | 48.09 | [VD](), [VS]() |
| **InstructBLIP** <br />(GPT Eval) | 9.45 | 10.11 | 35.60 | 45.12 | 45.26 | [VD](), [VS]() |
| **Qwen-VL** <br />(GPT Eval) | 5.93 | 6.65 | 31.43 | 24.88 | 39.15 | [VD](), [VS]() |
| **Open-Flamingo** <br />(GPT Eval) | 6.37 | 11.27 | 39.56 | 27.21 | 38.44 | [VD](), [VS]() |
| **MiniGPT5** <br />(GPT Eval) |10.55 | 9.83 | 36.04| 28.37 | 40.30 | [VD](), [VS]() |
| **MiniGPT4** <br />(GPT Eval) |8.79 | 10.12 | 31.87| 27.67 | 35.78 | [VD](), [VS]() |
| **mPLUG_Owl-v2** <br />(GPT Eval) |13.85 | 19.94 | 44.84| 39.07 | 47.30 | [VD](), [VS]() |
| **mPLUG_Owl-v1** <br />(GPT Eval) |9.45 | 10.40 | 39.34| 29.77 | 43.93 | [VD](), [VS]() |
| **GiT** <br />(GPT Eval) |5.27 | 6.36 | 26.81| 31.86 | 34.37 | [VD](), [VS]() |
### Reproduce GPT4V results on leaderboard
1. We saved the ouput of GPT4V with our annotation. Put `HallusionBench.tsv` in the root directory of this repo, or set `input_file_name` in [gpt4v_benchmark.py](./gpt4v_benchmark.py) to the location of the [HallusionBench.tsv](https://drive.google.com/file/d/1q8db7-7IlA4WLZ_5Jt-TpLDyAWg8Ybx4/view?usp=sharing) file.
2. (Optional) If you don't have access to GPT API, you don't need to run it since we have saved evaluation results. They can be downloaded for [Visual Dependent]() and [Visual Supplement](). Put the json files in the root directory of this repo, or set `save_json_path_vd` and `save_json_path_vd` in [gpt4v_benchmark.py](./gpt4v_benchmark.py) to their respective locations.
3. Run `python gpt4v_benchmark.py`.
## Examples and Analysis
<p align="center" >
<img src="./examples/f-01.png" alt="Example 1" class="center" width="800"/>
<img src="./examples/f-02.png" alt="Example 2" class="center" width="800"/>
<img src="./examples/f-04.png" alt="Example 3" class="center" width="800"/>
<img src="./examples/f-05.png" alt="Example 4" class="center" width="800"/>
<img src="./examples/f-08.png" alt="Example 5" class="center" width="800"/>
<img src="./examples/f-15.png" alt="Example 6" class="center" width="800"/>
<img src="./examples/f-10.png" alt="Example 7" class="center" width="800"/>
<img src="./examples/f-12.png" alt="Example 8" class="center" width="800"/>
<img src="./examples/f-17.png" alt="Example 9" class="center" width="800"/>
</p>
---
license: bsd-3-clause
---
# HallusionBench:面向大视觉语言模型(Large Vision-Language Models, VLMs)中交织型语言幻觉与视觉错觉的高级诊断套件
你所见即所思?抑或你所思即所见?一款对GPT-4V、LLaVA-1.5及其他多模态模型极具挑战性的图像上下文推理基准测试集
[Tianrui Guan*](https://tianruiguan.phd), [Fuxiao Liu*](https://fuxiaoliu.github.io/), Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou
🔥🔥🔥
## 我们欢迎所有人向我们的社区贡献大型多模态模型(Large Multimodal Models, LMMs,如GPT-4V)的失效案例!
🔥🔥🔥
大语言模型(Large Language Model, LLM)与视觉模型对齐后集成至视觉语言模型(VLMs),可在图像推理任务中实现显著性能提升,近期发布的GPT-4V、LLaVA-1.5等模型便印证了这一点。然而,这些主流大视觉语言模型中强大的语言先验实则是一把双刃剑:它们可能忽略图像上下文,仅依赖(甚至与图像矛盾的)语言先验进行推理。与之相对,VLMs中的视觉模块弱于语言模型,可能生成误导性的视觉表征,进而被语言模型转化为笃定的错误结论。为研究这两类VLMs失误——即语言幻觉与视觉错觉——我们精心构建了HallusionBench,一款即使对GPT-4V与LLaVA-1.5仍极具挑战性的图像上下文推理基准测试集。我们对HallusionBench中的样本展开了详尽分析,为理解VLMs的错觉或幻觉问题以及未来的改进方向提供了全新视角。
如果您认为我们的研究工作具有价值,请引用我们的论文:
bibtex
@misc{guan2023hallusionbench,
title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models},
author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou},
year={2023},
eprint={2310.14566},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{liu2023mitigating,
title={Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning},
author={Fuxiao Liu and Kevin Lin and Linjie Li and Jianfeng Wang and Yaser Yacoob and Lijuan Wang},
year={2023},
eprint={2306.14566},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
## 更新日志
- [11/28] 🔥 完整论文已上传,可通过[此链接](https://arxiv.org/abs/2310.14566)访问。数据集已扩充,排行榜亦已更新。
- [11/13] 🔥 更新了LLaVA-1.5的评估结果,更多模型结果即将上线!
- [10/27] 🔥 [排行榜](https://paperswithcode.com/sota/visual-question-answering-vqa-on-3)与评估代码已发布!**欢迎将您的模型更新至我们的排行榜!**
- [10/24] 🔥 包含案例分析与核心见解的预印本已可通过[此链接](https://arxiv.org/abs/2310.14566)获取。
- [10/23] 🔥 请参阅我们此前关于缓解多模态模型幻觉的研究工作:《通过鲁棒指令微调缓解大型多模态模型中的幻觉问题》([Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning](https://github.com/FuxiaoLiu/LRV-Instruction))。
## 数据集下载
为简化评估流程,我们仅提供以是非问答形式呈现的问题。
| 更新时间 | 问题与标注 | 图像文件 | 问题数量 | 图像数量 |
| ----------- | :----: | :----: | :----: | :----: |
| 2023年10月27日 | [HallusionBench.json](./HallusionBench.json) | [hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) | 254 | 69 |
### 评估流程
1. 克隆仓库
git clone https://github.com/tianyi-lab/HallusionBench.git
cd ./HallusionBench
2. 下载图像压缩包[hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing),并将其解压至与仓库同级的目录中。
3. 问题与图像路径信息存储于`./HallusionBench.json`中,数据样本格式如下:
{'category': 'VD', 'subcategory': 'illusion', 'visual_input': '1', 'set_id': '0', 'figure_id': '0', 'sample_note': 'circle', 'question_id': '0', 'question': "Is the right orange circle the same size as the left orange circle?", 'gt_answer_details': "The right orange circle is the same size as the left orange circle.", 'gt_answer': '1', 'filename': './hallusion_bench/VD/illusion/0_0.png'}
其中关键字段`visual_input`用于标识问题是否需要图像输入:若`visual_input=1`,则该问题需要图像输入;若`visual_input=0`,则该问题无需图像输入,属于纯文本问题。
4. 在`./HallusionBench.json`上运行您的模型,并将输出结果保存为`./HallusionBench_result.json`,请将模型预测结果存入键`'model_prediction'`中。我们提供了一份示例结果[此处](./HallusionBench_result_sample.json)。
5. 最后,运行以下代码完成评估:
python evaluation.py
您可通过修改[utils.py文件第10行](./utils.py#L10)的配置,使用您自己的API密钥调用GPT-4进行评估。
## 排行榜
### 任务定义
* **视觉依赖型(Visual Dependent, VD)问题**:若无视觉上下文则无法给出确切答案的问题。
* **简单样本**:源自互联网的原始图像。
* **困难样本**:基于原始图像编辑得到的图像。
* **视觉补充型(Visual Supplement, VS)问题**:无需视觉输入即可作答的问题,视觉组件仅提供补充信息。
* **简单样本**:无视觉输入。若无幻觉情况下得到的不确定答案亦视为正确响应。
* **困难样本**:带有视觉输入。答案必须遵循所提供的图像与视觉上下文。
### 评估指标
* **单图像准确率(一致性测试)**:基于单张图像的准确率。为确保模型真正理解图像内容,我们针对同一张图像的相关知识点设计多个变体问题,若模型能正确回答所有相关问题,则视为通过该测试。例如,模型不应在"橙色右圆与左圆尺寸是否一致?"与"橙色左圆与右圆尺寸是否一致?"这两个问题上给出矛盾的回答。
* **单问题准确率**:所有问题的整体准确率,包含简单与困难样本。
* **问题对准确率**:我们针对相似图像(或有无图像的场景)提出相同问题,将同一问题文本在不同视觉上下文下的测试视为**问题对**(通常对应一个简单样本与一个对应的困难样本)。该指标用于计算所有问题对的准确率。
| 模型 | 问题对准确率 | 单图像准确率 | 简单问题准确率 | 困难问题准确率 | 单问题准确率 | 结果文件 |
| ----- | :----: | :----: | :----: | :----: | :----: | :----: |
| **GPT4V** <br />2023年9月25日版本 <br />(人工评估) | 31.42 | 44.22 | 79.56 | 38.37 | 67.58 | [VD](), [VS]() |
| **GPT4V** <br />2023年9月25日版本 <br />(GPT评估) | 28.79 | 39.88 | 75.60 | 37.67 | 65.28 | [VD](), [VS]() |
| **LLaVA-1.5** <br />(人工评估) | 9.45 | 25.43 | 50.77 | 29.07 | 47.12 | [VD](), [VS]() |
| **LLaVA-1.5** <br />(GPT评估) | 10.55 | 24.86 | 49.67 | 29.77 | 46.94 | [VD](), [VS]() |
| **BLIP2-T5** <br />(GPT评估) | 15.16 | 20.52 | 45.49 | 43.49 | 48.09 | [VD](), [VS]() |
| **InstructBLIP** <br />(GPT评估) | 9.45 | 10.11 | 35.60 | 45.12 | 45.26 | [VD](), [VS]() |
| **Qwen-VL** <br />(GPT评估) | 5.93 | 6.65 | 31.43 | 24.88 | 39.15 | [VD](), [VS]() |
| **Open-Flamingo** <br />(GPT评估) | 6.37 | 11.27 | 39.56 | 27.21 | 38.44 | [VD](), [VS]() |
| **MiniGPT5** <br />(GPT评估) |10.55 | 9.83 | 36.04| 28.37 | 40.30 | [VD](), [VS]() |
| **MiniGPT4** <br />(GPT评估) |8.79 | 10.12 | 31.87| 27.67 | 35.78 | [VD](), [VS]() |
| **mPLUG_Owl-v2** <br />(GPT评估) |13.85 | 19.94 | 44.84| 39.07 | 47.30 | [VD](), [VS]() |
| **mPLUG_Owl-v1** <br />(GPT评估) |9.45 | 10.40 | 39.34| 29.77 | 43.93 | [VD](), [VS]() |
| **GiT** <br />(GPT评估) |5.27 | 6.36 | 26.81| 31.86 | 34.37 | [VD](), [VS]() |
### 复现排行榜上的GPT4V结果
1. 我们已保存了GPT4V的输出与对应的标注结果。将`HallusionBench.tsv`放入本仓库的根目录,或在[gpt4v_benchmark.py](./gpt4v_benchmark.py)中修改`input_file_name`字段,将其设置为`HallusionBench.tsv`文件的实际路径。
2. (可选)若您无法访问GPT API,无需运行该脚本,因为我们已保存了评估结果,可分别从[视觉依赖型](https://drive.google.com/file/d/1q8db7-7IlA4WLZ_5Jt-TpLDyAWg8Ybx4/view?usp=sharing)与[视觉补充型]()的链接下载对应的JSON文件。将下载的JSON文件放入本仓库根目录,或在[gpt4v_benchmark.py](./gpt4v_benchmark.py)中分别修改`save_json_path_vd`与`save_json_path_vs`字段为对应文件的路径。
3. 运行`python gpt4v_benchmark.py`。
## 示例与分析
<p align="center" >
<img src="./examples/f-01.png" alt="示例1" class="center" width="800"/>
<img src="./examples/f-02.png" alt="示例2" class="center" width="800"/>
<img src="./examples/f-04.png" alt="示例3" class="center" width="800"/>
<img src="./examples/f-05.png" alt="示例4" class="center" width="800"/>
<img src="./examples/f-08.png" alt="示例5" class="center" width="800"/>
<img src="./examples/f-15.png" alt="示例6" class="center" width="800"/>
<img src="./examples/f-10.png" alt="示例7" class="center" width="800"/>
<img src="./examples/f-12.png" alt="示例8" class="center" width="800"/>
<img src="./examples/f-17.png" alt="示例9" class="center" width="800"/>
</p>
---
许可证:BSD-3-clause
---