# HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models
You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
[Tianrui Guan*](https://tianruiguan.phd), [Fuxiao Liu*](https://fuxiaoliu.github.io/), Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou
🔥🔥🔥
## We welcome everyone to contribute the failure cases of Large Multimodal Models (GPT-4V) to our community!
🔥🔥🔥
Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs. To study these two types of VLM mistakes, i.e., language hallucination and visual illusion, we curated HallusionBench, an image-context reasoning benchmark that is still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed analysis of examples in HallusionBench, which sheds novel insights on the illusion or hallucination of VLMs and how to improve them in the future.
If you find our paper useful, please cite our paper:
```bibtex
@misc{guan2023hallusionbench,
title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models},
author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou},
year={2023},
eprint={2310.14566},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{liu2023mitigating,
title={Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning},
author={Fuxiao Liu and Kevin Lin and Linjie Li and Jianfeng Wang and Yaser Yacoob and Lijuan Wang},
year={2023},
eprint={2306.14565},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
## Updates
- [11/28] 🔥 The full paper is uploaded and can be accessed [here](https://arxiv.org/abs/2310.14566). The dataset is expanded and leaderboard is updated.
- [11/13] 🔥 Evaluation result on LLaVA-1.5 is updated. More model results to come!
- [10/27] 🔥 The [leaderboard](https://paperswithcode.com/sota/visual-question-answering-vqa-on-3) and evaluation code is released! **Welcome to update your model on our leaderboard!**
- [10/24] 🔥 The early report with case analysis and insights is available [here](https://arxiv.org/abs/2310.14566).
- [10/23] 🔥 Please check our previous work on mitigating hallucinations of LMMs ["Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning"](https://github.com/FuxiaoLiu/LRV-Instruction).
## Dataset Download
To keep evaluation simple, we only provide the question in form of yes/no questions.
| Updated on | Questions and Annotations | Figures | Question Count | Figure Count |
| ----------- | :----: | :----: | :----: | :----: |
| Oct 27, 2023 | [HallusionBench.json](./HallusionBench.json) | [hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) | 254 | 69 |
### Evaluation
1. Clone the repo.
```
git clone https://github.com/tianyi-lab/HallusionBench.git
cd ./HallusionBench
```
2. Download the images [hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) and unzip the folder in the same directory.
3. The questions and image locations are saved in `./HallusionBench.json`. The data sample are as follows:
```
{'category': 'VD', 'subcategory': 'illusion', 'visual_input': '1', 'set_id': '0', 'figure_id': '0', 'sample_note': 'circle', 'question_id': '0', 'question': 'Is the right orange circle the same size as the left orange circle?', 'gt_answer_details': 'The right orange circle is the same size as the left orange circle.', 'gt_answer': '1', 'filename': './hallusion_bench/VD/illusion/0_0.png'}
```
The key `visual_input`means whether the question needs visual input like images. If `visual_input=1`, it means the question need visual input. If `visual_input=0`, it means the question doesn't need visual input. It's the text-only question.
4. Run your model on `./HallusionBench.json` and save the ouput file as `./HallusionBench_result.json`. You need to add the output of your model in the key `'model_prediction'`. We provide an sample result [here](./HallusionBench_result_sample.json).
5. Finally, run the following code for evaluation:
```
python evaluation.py
```
You can use your own API key for GPT4 evaluation by editing the code [here](./utils.py#L10).
## Leaderboard
### Definition
* **Visual Dependent (VD) Questions**: questions that do not have an affirmative answer without the visual context.
* **Easy**: Original images that are obtained from Internet.
* **Hard**: Edited images from the original images.
* **Visual Supplement (VS) Questions**: questions that can be answered without the visual input; the visual component merely provides supplemental information.
* **Easy**: No visual input. Uncertain answer without hallucination is also considered correct response.
* **Hard**: With visual input. The answer must follow the provided figure and visual context.
### Metric
* **Accuracy per Figure (Consistency Test)**: Accuracy based on each figure. To make sure the mode truly understand image, we ask variant of questions based on the same knowledge on the same figure, and consider it correct if the model can answer all questions correctly. For example, the model should not give inconsistent responses on the questions "Is A bigger than B?" and "Is B smaller A?".
* **Accuracy per Question**: Accuracy of all questions, including easy and hard questions.
* **Accuracy per Question Pair**: We ask the same questions on similar images (or, with and without images). We consider the same question text on different visual contexts a **question pair** (usually they come in with an *easy* question and a corresponding *hard* question). This metric calculate accuracy of all question pairs.
| Model | Question Pair Acc | Figure Acc | Easy Question Acc | Hard Question Acc | Question Acc | Json |
| ----- | :----: | :----: | :----: | :----: | :----: | :----: |
| **GPT4V** <br />Sep 25, 2023 Version <br />(Human Eval) | 31.42 | 44.22 | 79.56 | 38.37 | 67.58 | [VD](), [VS]() |
| **GPT4V** <br />Sep 25, 2023 Version <br />(GPT Eval) | 28.79 | 39.88 | 75.60 | 37.67 | 65.28 | [VD](), [VS]() |
| **LLaVA-1.5** <br />(Human Eval) | 9.45 | 25.43 | 50.77 | 29.07 | 47.12 | [VD](), [VS]() |
| **LLaVA-1.5** <br />(GPT Eval) | 10.55 | 24.86 | 49.67 | 29.77 | 46.94 | [VD](), [VS]() |
| **BLIP2-T5** <br />(GPT Eval) | 15.16 | 20.52 | 45.49 | 43.49 | 48.09 | [VD](), [VS]() |
| **InstructBLIP** <br />(GPT Eval) | 9.45 | 10.11 | 35.60 | 45.12 | 45.26 | [VD](), [VS]() |
| **Qwen-VL** <br />(GPT Eval) | 5.93 | 6.65 | 31.43 | 24.88 | 39.15 | [VD](), [VS]() |
| **Open-Flamingo** <br />(GPT Eval) | 6.37 | 11.27 | 39.56 | 27.21 | 38.44 | [VD](), [VS]() |
| **MiniGPT5** <br />(GPT Eval) |10.55 | 9.83 | 36.04| 28.37 | 40.30 | [VD](), [VS]() |
| **MiniGPT4** <br />(GPT Eval) |8.79 | 10.12 | 31.87| 27.67 | 35.78 | [VD](), [VS]() |
| **mPLUG_Owl-v2** <br />(GPT Eval) |13.85 | 19.94 | 44.84| 39.07 | 47.30 | [VD](), [VS]() |
| **mPLUG_Owl-v1** <br />(GPT Eval) |9.45 | 10.40 | 39.34| 29.77 | 43.93 | [VD](), [VS]() |
| **GiT** <br />(GPT Eval) |5.27 | 6.36 | 26.81| 31.86 | 34.37 | [VD](), [VS]() |
### Reproduce GPT4V results on leaderboard
1. We saved the ouput of GPT4V with our annotation. Put `HallusionBench.tsv` in the root directory of this repo, or set `input_file_name` in [gpt4v_benchmark.py](./gpt4v_benchmark.py) to the location of the [HallusionBench.tsv](https://drive.google.com/file/d/1q8db7-7IlA4WLZ_5Jt-TpLDyAWg8Ybx4/view?usp=sharing) file.
2. (Optional) If you don't have access to GPT API, you don't need to run it since we have saved evaluation results. They can be downloaded for [Visual Dependent]() and [Visual Supplement](). Put the json files in the root directory of this repo, or set `save_json_path_vd` and `save_json_path_vd` in [gpt4v_benchmark.py](./gpt4v_benchmark.py) to their respective locations.
3. Run `python gpt4v_benchmark.py`.
## Examples and Analysis
<p align="center" >
<img src="./examples/f-01.png" alt="Example 1" class="center" width="800"/>
<img src="./examples/f-02.png" alt="Example 2" class="center" width="800"/>
<img src="./examples/f-04.png" alt="Example 3" class="center" width="800"/>
<img src="./examples/f-05.png" alt="Example 4" class="center" width="800"/>
<img src="./examples/f-08.png" alt="Example 5" class="center" width="800"/>
<img src="./examples/f-15.png" alt="Example 6" class="center" width="800"/>
<img src="./examples/f-10.png" alt="Example 7" class="center" width="800"/>
<img src="./examples/f-12.png" alt="Example 8" class="center" width="800"/>
<img src="./examples/f-17.png" alt="Example 9" class="center" width="800"/>
</p>
---
license: bsd-3-clause
---
# HallusionBench:面向大视觉语言模型中纠缠式语言幻觉与视觉错觉的高级诊断套件(HallusionBench)
## 你所见即你所想?抑或你所想即你所见?一个挑战GPT-4V(ision)、LLaVA-1.5及其他多模态模型的图像上下文推理基准测试
[Tianrui Guan*](https://tianruiguan.phd), [Fuxiao Liu*](https://fuxiaoliu.github.io/), 吴锡阳, 仙瑞琪, 李宗夏, 刘晓宇, 王希军, 陈立昌, 黄福荣, Yaser Yacoob, Dinesh Manocha, 周天一
🔥🔥🔥
## 我们诚挚欢迎各界人士向社区贡献大型多模态模型(Large Multimodal Model, LMM)的失效案例!🔥🔥🔥
大型语言模型(Large Language Model, LLM)与视觉模型对齐并集成至视觉语言模型(Vision-Language Model, VLM)后,可在图像推理任务中实现显著性能提升,近期发布的GPT-4V(ision)、LLaVA-1.5等模型均验证了这一点。然而,这些顶尖大型视觉语言模型(LVLM)中强大的语言先验实为一把双刃剑:它们可能忽略图像上下文,仅依赖(甚至与图像矛盾的)语言先验进行推理。与之相对,VLM中的视觉模块弱于LLM,可能生成误导性的视觉表征,进而被LLM转化为笃定的错误结论。为研究这两类VLM失误——即语言幻觉与视觉错觉,我们精心构建了HallusionBench,这一图像上下文推理基准测试,即便对GPT-4V与LLaVA-1.5仍极具挑战性。我们对HallusionBench中的样本展开了详尽分析,为理解VLM的错觉与幻觉问题及未来改进方向提供了全新视角。
若您认为我们的工作有价值,请引用以下论文:
bibtex
@misc{guan2023hallusionbench,
title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models},
author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou},
year={2023},
eprint={2310.14566},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{liu2023mitigating,
title={Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning},
author={Fuxiao Liu and Kevin Lin and Linjie Li and Jianfeng Wang and Yaser Yacoob and Lijuan Wang},
year={2023},
eprint={2306.14565},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
## 更新记录
- [11/28] 🔥 完整论文已上传,可通过[此处](https://arxiv.org/abs/2310.14566)访问。数据集已扩充,排行榜亦已更新。
- [11/13] 🔥 更新了LLaVA-1.5的评估结果,更多模型结果即将推出!
- [10/27] 🔥 [排行榜](https://paperswithcode.com/sota/visual-question-answering-vqa-on-3)与评估代码已发布!**欢迎将您的模型更新至排行榜!**
- [10/24] 🔥 包含案例分析与核心结论的预印本已可通过[此处](https://arxiv.org/abs/2310.14566)获取。
- [10/23] 🔥 请参阅我们此前关于抑制LMM幻觉的工作:["通过鲁棒指令微调缓解大型多模态模型中的幻觉问题"](https://github.com/FuxiaoLiu/LRV-Instruction)。
## 数据集下载
为简化评估流程,我们仅提供二分类(是/否)形式的问题。
| 更新日期 | 问题与标注 | 图像文件 | 问题数量 | 图像数量 |
| ----------- | :----: | :----: | :----: | :----: |
| 2023年10月27日 | [HallusionBench.json](./HallusionBench.json) | [hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) | 254 | 69 |
### 评估流程
1. 克隆代码仓库
git clone https://github.com/tianyi-lab/HallusionBench.git
cd ./HallusionBench
2. 下载图像压缩包[hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing),并解压至与代码仓库同级目录。
3. 问题与图像路径信息存储于`./HallusionBench.json`中,数据样本格式示例如下:
{'category': 'VD', 'subcategory': 'illusion', 'visual_input': '1', 'set_id': '0', 'figure_id': '0', 'sample_note': 'circle', 'question_id': '0', 'question': 'Is the right orange circle the same size as the left orange circle?', 'gt_answer_details': 'The right orange circle is the same size as the left orange circle.', 'gt_answer': '1', 'filename': './hallusion_bench/VD/illusion/0_0.png'}
其中字段`visual_input`用于标识问题是否需要图像输入:若`visual_input=1`,则该问题需要图像输入;若`visual_input=0`,则该问题为纯文本问题,无需图像输入。
4. 在`./HallusionBench.json`上运行您的模型,并将输出结果保存为`./HallusionBench_result.json`,需将模型预测结果存入键`'model_prediction'`中。我们提供了一份示例结果[此处](./HallusionBench_result_sample.json)。
5. 最后,运行以下代码完成评估:
python evaluation.py
您可通过修改[utils.py#L10](./utils.py#L10)中的代码,使用您自己的API密钥调用GPT-4进行评估。
## 排行榜
### 指标定义
* **视觉依赖型(Visual Dependent, VD)问题**:若无视觉上下文则无法给出确定答案的问题。
* **简单样本**:源自互联网的原始图像。
* **困难样本**:基于原始图像编辑得到的图像。
* **视觉补充型(Visual Supplement, VS)问题**:无需视觉输入即可作答的问题,视觉组件仅提供辅助信息。
* **简单样本**:无图像输入。若模型未产生幻觉且给出不确定答案,亦视为正确响应。
* **困难样本**:带有图像输入。答案必须遵循给定图像与视觉上下文。
### 评估指标
* **单图像准确率(一致性测试)**:基于单张图像的准确率。为确保模型真正理解图像内容,我们针对同一张图像的相关知识设计多组变体问题,若模型能正确回答所有相关问题,则视为通过测试。例如,模型不应在问题“A是否比B大?”与“B是否比A小?”上给出不一致的回答。
* **单问题准确率**:所有问题的准确率,涵盖简单与困难样本。
* **问题对准确率**:我们针对相似图像(或有无图像的场景)设计相同的问题,将不同视觉上下文下的同一文本问题视为一个**问题对**(通常对应一个简单样本与一个困难样本)。该指标用于计算所有问题对的准确率。
| 模型 | 问题对准确率 | 单图像准确率 | 简单问题准确率 | 困难问题准确率 | 单问题准确率 | 结果文件 |
| ----- | :----: | :----: | :----: | :----: | :----: | :----: |
| **GPT-4V** <br />2023年9月25日版本 <br />(人工评估) | 31.42 | 44.22 | 79.56 | 38.37 | 67.58 | [视觉依赖型](), [视觉补充型]() |
| **GPT-4V** <br />2023年9月25日版本 <br />(GPT评估) | 28.79 | 39.88 | 75.60 | 37.67 | 65.28 | [视觉依赖型](), [视觉补充型]() |
| **LLaVA-1.5** <br />(人工评估) | 9.45 | 25.43 | 50.77 | 29.07 | 47.12 | [视觉依赖型](), [视觉补充型]() |
| **LLaVA-1.5** <br />(GPT评估) | 10.55 | 24.86 | 49.67 | 29.77 | 46.94 | [视觉依赖型](), [视觉补充型]() |
| **BLIP2-T5** <br />(GPT评估) | 15.16 | 20.52 | 45.49 | 43.49 | 48.09 | [视觉依赖型](), [视觉补充型]() |
| **InstructBLIP** <br />(GPT评估) | 9.45 | 10.11 | 35.60 | 45.12 | 45.26 | [视觉依赖型](), [视觉补充型]() |
| **Qwen-VL** <br />(GPT评估) | 5.93 | 6.65 | 31.43 | 24.88 | 39.15 | [视觉依赖型](), [视觉补充型]() |
| **Open-Flamingo** <br />(GPT评估) | 6.37 | 11.27 | 39.56 | 27.21 | 38.44 | [视觉依赖型](), [视觉补充型]() |
| **MiniGPT5** <br />(GPT评估) |10.55 | 9.83 | 36.04| 28.37 | 40.30 | [视觉依赖型](), [视觉补充型]() |
| **MiniGPT4** <br />(GPT评估) |8.79 | 10.12 | 31.87| 27.67 | 35.78 | [视觉依赖型](), [视觉补充型]() |
| **mPLUG_Owl-v2** <br />(GPT评估) |13.85 | 19.94 | 44.84| 39.07 | 47.30 | [视觉依赖型](), [视觉补充型]() |
| **mPLUG_Owl-v1** <br />(GPT评估) |9.45 | 10.40 | 39.34| 29.77 | 43.93 | [视觉依赖型](), [视觉补充型]() |
| **GiT** <br />(GPT评估) |5.27 | 6.36 | 26.81| 31.86 | 34.37 | [视觉依赖型](), [视觉补充型]() |
### 复现排行榜上的GPT-4V结果
1. 我们已保存了GPT-4V的输出与标注结果。将`HallusionBench.tsv`放入本仓库的根目录,或修改[gpt4v_benchmark.py](./gpt4v_benchmark.py)中的`input_file_name`参数,将其设置为[HallusionBench.tsv](https://drive.google.com/file/d/1q8db7-7IlA4WLZ_5Jt-TpLDyAWg8Ybx4/view?usp=sharing)文件的路径。
2. (可选)若您无法访问GPT API,无需运行该代码,因为我们已保存了评估结果,可分别从[视觉依赖型]()与[视觉补充型]()处下载。将下载的JSON文件放入本仓库根目录,或修改[gpt4v_benchmark.py](./gpt4v_benchmark.py)中的`save_json_path_vd`与`save_json_path_vs`参数,分别设置为对应文件的路径。
3. 运行`python gpt4v_benchmark.py`。
## 示例与分析
<p align="center" >
<img src="./examples/f-01.png" alt="示例1" class="center" width="800"/>
<img src="./examples/f-02.png" alt="示例2" class="center" width="800"/>
<img src="./examples/f-04.png" alt="示例3" class="center" width="800"/>
<img src="./examples/f-05.png" alt="示例4" class="center" width="800"/>
<img src="./examples/f-08.png" alt="示例5" class="center" width="800"/>
<img src="./examples/f-15.png" alt="示例6" class="center" width="800"/>
<img src="./examples/f-10.png" alt="示例7" class="center" width="800"/>
<img src="./examples/f-12.png" alt="示例8" class="center" width="800"/>
<img src="./examples/f-17.png" alt="示例9" class="center" width="800"/>
</p>
---
许可证:BSD 3条款许可证
---