rayguan/HallusionBench

Hugging Face2023-12-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/rayguan/HallusionBench

下载链接

链接失效反馈

资源简介：

# HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models [Tianrui Guan*](https://tianruiguan.phd), [Fuxiao Liu*](https://fuxiaoliu.github.io/), Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou 🔥🔥🔥 ## We welcome everyone to contribute the failure cases of Large Multimodal Models (GPT-4V) to our community! 🔥🔥🔥 Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs. To study these two types of VLM mistakes, i.e., language hallucination and visual illusion, we curated HallusionBench, an image-context reasoning benchmark that is still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed analysis of examples in HallusionBench, which sheds novel insights on the illusion or hallucination of VLMs and how to improve them in the future. If you find our paper useful, please cite our paper: ```bibtex @misc{guan2023hallusionbench, title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models}, author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou}, year={2023}, eprint={2310.14566}, archivePrefix={arXiv}, primaryClass={cs.CV} } @misc{liu2023mitigating, title={Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning}, author={Fuxiao Liu and Kevin Lin and Linjie Li and Jianfeng Wang and Yaser Yacoob and Lijuan Wang}, year={2023}, eprint={2306.14565}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` ## Updates - [11/28] 🔥 The full paper is uploaded and can be accessed [here](https://arxiv.org/abs/2310.14566). The dataset is expanded and leaderboard is updated. - [11/13] 🔥 Evaluation result on LLaVA-1.5 is updated. More model results to come! - [10/27] 🔥 The [leaderboard](https://paperswithcode.com/sota/visual-question-answering-vqa-on-3) and evaluation code is released! **Welcome to update your model on our leaderboard!** - [10/24] 🔥 The early report with case analysis and insights is available [here](https://arxiv.org/abs/2310.14566). - [10/23] 🔥 Please check our previous work on mitigating hallucinations of LMMs ["Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning"](https://github.com/FuxiaoLiu/LRV-Instruction). ## Dataset Download To keep evaluation simple, we only provide the question in form of yes/no questions. | Updated on | Questions and Annotations | Figures | Question Count | Figure Count | | ----------- | :----: | :----: | :----: | :----: | | Oct 27, 2023 | [HallusionBench.json](./HallusionBench.json) | [hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) | 254 | 69 | ### Evaluation 1. Clone the repo. ``` git clone https://github.com/tianyi-lab/HallusionBench.git cd ./HallusionBench ``` 2. Download the images [hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) and unzip the folder in the same directory. 3. The questions and image locations are saved in `./HallusionBench.json`. The data sample are as follows: ``` {'category': 'VD', 'subcategory': 'illusion', 'visual_input': '1', 'set_id': '0', 'figure_id': '0', 'sample_note': 'circle', 'question_id': '0', 'question': 'Is the right orange circle the same size as the left orange circle?', 'gt_answer_details': 'The right orange circle is the same size as the left orange circle.', 'gt_answer': '1', 'filename': './hallusion_bench/VD/illusion/0_0.png'} ``` The key `visual_input`means whether the question needs visual input like images. If `visual_input=1`, it means the question need visual input. If `visual_input=0`, it means the question doesn't need visual input. It's the text-only question. 4. Run your model on `./HallusionBench.json` and save the ouput file as `./HallusionBench_result.json`. You need to add the output of your model in the key `'model_prediction'`. We provide an sample result [here](./HallusionBench_result_sample.json). 5. Finally, run the following code for evaluation: ``` python evaluation.py ``` You can use your own API key for GPT4 evaluation by editing the code [here](./utils.py#L10). ## Leaderboard ### Definition * **Visual Dependent (VD) Questions**: questions that do not have an affirmative answer without the visual context. * **Easy**: Original images that are obtained from Internet. * **Hard**: Edited images from the original images. * **Visual Supplement (VS) Questions**: questions that can be answered without the visual input; the visual component merely provides supplemental information. * **Easy**: No visual input. Uncertain answer without hallucination is also considered correct response. * **Hard**: With visual input. The answer must follow the provided figure and visual context. ### Metric * **Accuracy per Figure (Consistency Test)**: Accuracy based on each figure. To make sure the mode truly understand image, we ask variant of questions based on the same knowledge on the same figure, and consider it correct if the model can answer all questions correctly. For example, the model should not give inconsistent responses on the questions "Is A bigger than B?" and "Is B smaller A?". * **Accuracy per Question**: Accuracy of all questions, including easy and hard questions. * **Accuracy per Question Pair**: We ask the same questions on similar images (or, with and without images). We consider the same question text on different visual contexts a **question pair** (usually they come in with an *easy* question and a corresponding *hard* question). This metric calculate accuracy of all question pairs. | Model | Question Pair Acc | Figure Acc | Easy Question Acc | Hard Question Acc | Question Acc | Json | | ----- | :----: | :----: | :----: | :----: | :----: | :----: | | **GPT4V** Sep 25, 2023 Version (Human Eval) | 31.42 | 44.22 | 79.56 | 38.37 | 67.58 | [VD](), [VS]() | | **GPT4V** Sep 25, 2023 Version (GPT Eval) | 28.79 | 39.88 | 75.60 | 37.67 | 65.28 | [VD](), [VS]() | | **LLaVA-1.5** (Human Eval) | 9.45 | 25.43 | 50.77 | 29.07 | 47.12 | [VD](), [VS]() | | **LLaVA-1.5** (GPT Eval) | 10.55 | 24.86 | 49.67 | 29.77 | 46.94 | [VD](), [VS]() | | **BLIP2-T5** (GPT Eval) | 15.16 | 20.52 | 45.49 | 43.49 | 48.09 | [VD](), [VS]() | | **InstructBLIP** (GPT Eval) | 9.45 | 10.11 | 35.60 | 45.12 | 45.26 | [VD](), [VS]() | | **Qwen-VL** (GPT Eval) | 5.93 | 6.65 | 31.43 | 24.88 | 39.15 | [VD](), [VS]() | | **Open-Flamingo** (GPT Eval) | 6.37 | 11.27 | 39.56 | 27.21 | 38.44 | [VD](), [VS]() | | **MiniGPT5** (GPT Eval) |10.55 | 9.83 | 36.04| 28.37 | 40.30 | [VD](), [VS]() | | **MiniGPT4** (GPT Eval) |8.79 | 10.12 | 31.87| 27.67 | 35.78 | [VD](), [VS]() | | **mPLUG_Owl-v2** (GPT Eval) |13.85 | 19.94 | 44.84| 39.07 | 47.30 | [VD](), [VS]() | | **mPLUG_Owl-v1** (GPT Eval) |9.45 | 10.40 | 39.34| 29.77 | 43.93 | [VD](), [VS]() | | **GiT** (GPT Eval) |5.27 | 6.36 | 26.81| 31.86 | 34.37 | [VD](), [VS]() | ### Reproduce GPT4V results on leaderboard 1. We saved the ouput of GPT4V with our annotation. Put `HallusionBench.tsv` in the root directory of this repo, or set `input_file_name` in [gpt4v_benchmark.py](./gpt4v_benchmark.py) to the location of the [HallusionBench.tsv](https://drive.google.com/file/d/1q8db7-7IlA4WLZ_5Jt-TpLDyAWg8Ybx4/view?usp=sharing) file. 2. (Optional) If you don't have access to GPT API, you don't need to run it since we have saved evaluation results. They can be downloaded for [Visual Dependent]() and [Visual Supplement](). Put the json files in the root directory of this repo, or set `save_json_path_vd` and `save_json_path_vd` in [gpt4v_benchmark.py](./gpt4v_benchmark.py) to their respective locations. 3. Run `python gpt4v_benchmark.py`. ## Examples and Analysis <img src="./examples/f-01.png" alt="Example 1" class="center" width="800"/> <img src="./examples/f-02.png" alt="Example 2" class="center" width="800"/> <img src="./examples/f-04.png" alt="Example 3" class="center" width="800"/> <img src="./examples/f-05.png" alt="Example 4" class="center" width="800"/> <img src="./examples/f-08.png" alt="Example 5" class="center" width="800"/> <img src="./examples/f-15.png" alt="Example 6" class="center" width="800"/> <img src="./examples/f-10.png" alt="Example 7" class="center" width="800"/> <img src="./examples/f-12.png" alt="Example 8" class="center" width="800"/> <img src="./examples/f-17.png" alt="Example 9" class="center" width="800"/> --- license: bsd-3-clause ---

# HallusionBench：面向大视觉语言模型中纠缠式语言幻觉与视觉错觉的高级诊断套件（HallusionBench） ## 你所见即你所想？抑或你所想即你所见？一个挑战GPT-4V(ision)、LLaVA-1.5及其他多模态模型的图像上下文推理基准测试 [Tianrui Guan*](https://tianruiguan.phd), [Fuxiao Liu*](https://fuxiaoliu.github.io/), 吴锡阳, 仙瑞琪, 李宗夏, 刘晓宇, 王希军, 陈立昌, 黄福荣, Yaser Yacoob, Dinesh Manocha, 周天一 🔥🔥🔥 ## 我们诚挚欢迎各界人士向社区贡献大型多模态模型（Large Multimodal Model, LMM）的失效案例！🔥🔥🔥 大型语言模型（Large Language Model, LLM）与视觉模型对齐并集成至视觉语言模型（Vision-Language Model, VLM）后，可在图像推理任务中实现显著性能提升，近期发布的GPT-4V(ision)、LLaVA-1.5等模型均验证了这一点。然而，这些顶尖大型视觉语言模型（LVLM）中强大的语言先验实为一把双刃剑：它们可能忽略图像上下文，仅依赖（甚至与图像矛盾的）语言先验进行推理。与之相对，VLM中的视觉模块弱于LLM，可能生成误导性的视觉表征，进而被LLM转化为笃定的错误结论。为研究这两类VLM失误——即语言幻觉与视觉错觉，我们精心构建了HallusionBench，这一图像上下文推理基准测试，即便对GPT-4V与LLaVA-1.5仍极具挑战性。我们对HallusionBench中的样本展开了详尽分析，为理解VLM的错觉与幻觉问题及未来改进方向提供了全新视角。若您认为我们的工作有价值，请引用以下论文： bibtex @misc{guan2023hallusionbench, title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models}, author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou}, year={2023}, eprint={2310.14566}, archivePrefix={arXiv}, primaryClass={cs.CV} } @misc{liu2023mitigating, title={Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning}, author={Fuxiao Liu and Kevin Lin and Linjie Li and Jianfeng Wang and Yaser Yacoob and Lijuan Wang}, year={2023}, eprint={2306.14565}, archivePrefix={arXiv}, primaryClass={cs.CV} } ## 更新记录 - [11/28] 🔥 完整论文已上传，可通过[此处](https://arxiv.org/abs/2310.14566)访问。数据集已扩充，排行榜亦已更新。 - [11/13] 🔥 更新了LLaVA-1.5的评估结果，更多模型结果即将推出！ - [10/27] 🔥 [排行榜](https://paperswithcode.com/sota/visual-question-answering-vqa-on-3)与评估代码已发布！**欢迎将您的模型更新至排行榜！** - [10/24] 🔥 包含案例分析与核心结论的预印本已可通过[此处](https://arxiv.org/abs/2310.14566)获取。 - [10/23] 🔥 请参阅我们此前关于抑制LMM幻觉的工作：["通过鲁棒指令微调缓解大型多模态模型中的幻觉问题"](https://github.com/FuxiaoLiu/LRV-Instruction)。 ## 数据集下载为简化评估流程，我们仅提供二分类（是/否）形式的问题。 | 更新日期 | 问题与标注 | 图像文件 | 问题数量 | 图像数量 | | ----------- | :----: | :----: | :----: | :----: | | 2023年10月27日 | [HallusionBench.json](./HallusionBench.json) | [hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) | 254 | 69 | ### 评估流程 1. 克隆代码仓库 git clone https://github.com/tianyi-lab/HallusionBench.git cd ./HallusionBench 2. 下载图像压缩包[hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing)，并解压至与代码仓库同级目录。 3. 问题与图像路径信息存储于`./HallusionBench.json`中，数据样本格式示例如下： {'category': 'VD', 'subcategory': 'illusion', 'visual_input': '1', 'set_id': '0', 'figure_id': '0', 'sample_note': 'circle', 'question_id': '0', 'question': 'Is the right orange circle the same size as the left orange circle?', 'gt_answer_details': 'The right orange circle is the same size as the left orange circle.', 'gt_answer': '1', 'filename': './hallusion_bench/VD/illusion/0_0.png'} 其中字段`visual_input`用于标识问题是否需要图像输入：若`visual_input=1`，则该问题需要图像输入；若`visual_input=0`，则该问题为纯文本问题，无需图像输入。 4. 在`./HallusionBench.json`上运行您的模型，并将输出结果保存为`./HallusionBench_result.json`，需将模型预测结果存入键`'model_prediction'`中。我们提供了一份示例结果[此处](./HallusionBench_result_sample.json)。 5. 最后，运行以下代码完成评估： python evaluation.py 您可通过修改[utils.py#L10](./utils.py#L10)中的代码，使用您自己的API密钥调用GPT-4进行评估。 ## 排行榜 ### 指标定义 * **视觉依赖型（Visual Dependent, VD）问题**：若无视觉上下文则无法给出确定答案的问题。 * **简单样本**：源自互联网的原始图像。 * **困难样本**：基于原始图像编辑得到的图像。 * **视觉补充型（Visual Supplement, VS）问题**：无需视觉输入即可作答的问题，视觉组件仅提供辅助信息。 * **简单样本**：无图像输入。若模型未产生幻觉且给出不确定答案，亦视为正确响应。 * **困难样本**：带有图像输入。答案必须遵循给定图像与视觉上下文。 ### 评估指标 * **单图像准确率（一致性测试）**：基于单张图像的准确率。为确保模型真正理解图像内容，我们针对同一张图像的相关知识设计多组变体问题，若模型能正确回答所有相关问题，则视为通过测试。例如，模型不应在问题“A是否比B大？”与“B是否比A小？”上给出不一致的回答。 * **单问题准确率**：所有问题的准确率，涵盖简单与困难样本。 * **问题对准确率**：我们针对相似图像（或有无图像的场景）设计相同的问题，将不同视觉上下文下的同一文本问题视为一个**问题对**（通常对应一个简单样本与一个困难样本）。该指标用于计算所有问题对的准确率。 | 模型 | 问题对准确率 | 单图像准确率 | 简单问题准确率 | 困难问题准确率 | 单问题准确率 | 结果文件 | | ----- | :----: | :----: | :----: | :----: | :----: | :----: | | **GPT-4V** 2023年9月25日版本 (人工评估) | 31.42 | 44.22 | 79.56 | 38.37 | 67.58 | [视觉依赖型](), [视觉补充型]() | | **GPT-4V** 2023年9月25日版本 (GPT评估) | 28.79 | 39.88 | 75.60 | 37.67 | 65.28 | [视觉依赖型](), [视觉补充型]() | | **LLaVA-1.5** (人工评估) | 9.45 | 25.43 | 50.77 | 29.07 | 47.12 | [视觉依赖型](), [视觉补充型]() | | **LLaVA-1.5** (GPT评估) | 10.55 | 24.86 | 49.67 | 29.77 | 46.94 | [视觉依赖型](), [视觉补充型]() | | **BLIP2-T5** (GPT评估) | 15.16 | 20.52 | 45.49 | 43.49 | 48.09 | [视觉依赖型](), [视觉补充型]() | | **InstructBLIP** (GPT评估) | 9.45 | 10.11 | 35.60 | 45.12 | 45.26 | [视觉依赖型](), [视觉补充型]() | | **Qwen-VL** (GPT评估) | 5.93 | 6.65 | 31.43 | 24.88 | 39.15 | [视觉依赖型](), [视觉补充型]() | | **Open-Flamingo** (GPT评估) | 6.37 | 11.27 | 39.56 | 27.21 | 38.44 | [视觉依赖型](), [视觉补充型]() | | **MiniGPT5** (GPT评估) |10.55 | 9.83 | 36.04| 28.37 | 40.30 | [视觉依赖型](), [视觉补充型]() | | **MiniGPT4** (GPT评估) |8.79 | 10.12 | 31.87| 27.67 | 35.78 | [视觉依赖型](), [视觉补充型]() | | **mPLUG_Owl-v2** (GPT评估) |13.85 | 19.94 | 44.84| 39.07 | 47.30 | [视觉依赖型](), [视觉补充型]() | | **mPLUG_Owl-v1** (GPT评估) |9.45 | 10.40 | 39.34| 29.77 | 43.93 | [视觉依赖型](), [视觉补充型]() | | **GiT** (GPT评估) |5.27 | 6.36 | 26.81| 31.86 | 34.37 | [视觉依赖型](), [视觉补充型]() | ### 复现排行榜上的GPT-4V结果 1. 我们已保存了GPT-4V的输出与标注结果。将`HallusionBench.tsv`放入本仓库的根目录，或修改[gpt4v_benchmark.py](./gpt4v_benchmark.py)中的`input_file_name`参数，将其设置为[HallusionBench.tsv](https://drive.google.com/file/d/1q8db7-7IlA4WLZ_5Jt-TpLDyAWg8Ybx4/view?usp=sharing)文件的路径。 2. （可选）若您无法访问GPT API，无需运行该代码，因为我们已保存了评估结果，可分别从[视觉依赖型]()与[视觉补充型]()处下载。将下载的JSON文件放入本仓库根目录，或修改[gpt4v_benchmark.py](./gpt4v_benchmark.py)中的`save_json_path_vd`与`save_json_path_vs`参数，分别设置为对应文件的路径。 3. 运行`python gpt4v_benchmark.py`。 ## 示例与分析 <img src="./examples/f-01.png" alt="示例1" class="center" width="800"/> <img src="./examples/f-02.png" alt="示例2" class="center" width="800"/> <img src="./examples/f-04.png" alt="示例3" class="center" width="800"/> <img src="./examples/f-05.png" alt="示例4" class="center" width="800"/> <img src="./examples/f-08.png" alt="示例5" class="center" width="800"/> <img src="./examples/f-15.png" alt="示例6" class="center" width="800"/> <img src="./examples/f-10.png" alt="示例7" class="center" width="800"/> <img src="./examples/f-12.png" alt="示例8" class="center" width="800"/> <img src="./examples/f-17.png" alt="示例9" class="center" width="800"/> --- 许可证：BSD 3条款许可证 ---

提供机构：

rayguan

原始信息汇总

HallusionBench 数据集概述

数据集描述

HallusionBench 是一个用于评估大型视觉-语言模型（VLMs）在图像推理任务中语言幻觉和视觉错觉问题的先进诊断套件。该数据集包含了一系列挑战性的图像-上下文推理问题，旨在揭示和分析 VLMs 在处理图像信息时的潜在问题。

数据集更新

最新更新日期：2023年10月27日
问题和标注文件：HallusionBench.json
图像文件：hallusion_bench.zip
问题数量：254
图像数量：69

数据集结构

数据集中的问题以是/否问题的形式提供，具体数据样本如下： json { "category": "VD", "subcategory": "illusion", "visual_input": "1", "set_id": "0", "figure_id": "0", "sample_note": "circle", "question_id": "0", "question": "Is the right orange circle the same size as the left orange circle?", "gt_answer_details": "The right orange circle is the same size as the left orange circle.", "gt_answer": "1", "filename": "./hallusion_bench/VD/illusion/0_0.png" }

visual_input 表示问题是否需要视觉输入。1 表示需要，0 表示不需要。

评估方法

克隆仓库： bash git clone https://github.com/tianyi-lab/HallusionBench.git cd ./HallusionBench
下载图像并解压： bash wget https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing unzip hallusion_bench.zip
运行模型并保存结果：
- 在 ./HallusionBench.json 上运行模型，并将输出保存为 ./HallusionBench_result.json。
- 在结果中添加 model_prediction 键。
评估模型： bash python evaluation.py

排行榜定义

视觉依赖（VD）问题：需要视觉上下文才能回答的问题。
- 简单：来自互联网的原始图像。
- 困难：从原始图像编辑的图像。
视觉补充（VS）问题：无需视觉输入即可回答的问题，视觉部分仅提供补充信息。
- 简单：无视觉输入，不确定的答案也被视为正确。
- 困难：有视觉输入，答案必须遵循提供的图像和视觉上下文。

评估指标

每图准确率（一致性测试）：基于每个图像的准确率，确保模型真正理解图像。
每问题准确率：所有问题的准确率，包括简单和困难问题。
每问题对准确率：在相似图像上提出相同问题，计算所有问题对的准确率。

排行榜

模型	问题对准确率	每图准确率	简单问题准确率	困难问题准确率	总问题准确率
GPT4V (Human Eval)	31.42	44.22	79.56	38.37	67.58
GPT4V (GPT Eval)	28.79	39.88	75.60	37.67	65.28
LLaVA-1.5 (Human Eval)	9.45	25.43	50.77	29.07	47.12
LLaVA-1.5 (GPT Eval)	10.55	24.86	49.67	29.77	46.94
BLIP2-T5 (GPT Eval)	15.16	20.52	45.49	43.49	48.09
InstructBLIP (GPT Eval)	9.45	10.11	35.60	45.12	45.26
Qwen-VL (GPT Eval)	5.93	6.65	31.43	24.88	39.15
Open-Flamingo (GPT Eval)	6.37	11.27	39.56	27.21	38.44
MiniGPT5 (GPT Eval)	10.55	9.83	36.04	28.37	40.30
MiniGPT4 (GPT Eval)	8.79	10.12	31.87	27.67	35.78
mPLUG_Owl-v2 (GPT Eval)	13.85	19.94	44.84	39.07	47.30
mPLUG_Owl-v1 (GPT Eval)	9.45	10.40	39.34	29.77	43.93
GiT (GPT Eval)	5.27	6.36	26.81	31.86	34.37

AI搜集汇总

数据集介绍

构建方式

HallusionBench数据集旨在探索大型视觉语言模型（VLMs）在图像推理任务中的语言幻觉和视觉错觉问题。该数据集包含两种类型的任务：视觉依赖（VD）问题和视觉补充（VS）问题。VD问题要求模型在无视觉输入的情况下无法给出肯定答案，而VS问题则可以在没有视觉输入的情况下回答。数据集包含了从互联网获取的原始图像和经过编辑的图像，以及对应的yes/no问题。为了确保模型的准确性，数据集中还包含了基于相同知识库的不同问题，要求模型在这些问题上给出一致性的答案。

特点

HallusionBench数据集的主要特点是它能够揭示大型视觉语言模型在图像推理任务中的缺陷。数据集中的问题设计旨在挑战GPT-4V、LLaVA-1.5等最先进的视觉语言模型。此外，数据集还提供了详细的案例分析，为研究人员提供了关于视觉语言模型幻觉或错觉的新见解，并为未来的改进提供了参考。

使用方法

使用HallusionBench数据集的方法包括：1. 克隆数据集的GitHub仓库。2. 下载并解压图像数据集。3. 运行模型进行预测，并将结果保存为JSON格式的文件。4. 使用提供的evaluation.py脚本对模型结果进行评估。评估指标包括每张图像的准确率、每个问题的准确率以及每对问题的准确率。用户还可以通过修改代码中的API密钥来使用GPT4进行评估。

背景与挑战

背景概述

在视觉语言模型（VLMs）领域，大型语言模型（LLMs）与视觉模型的结合已经取得了显著的进展，例如GPT-4V和LLaVA-1.5等模型的推出。然而，这些先进的模型在推理过程中往往过度依赖语言先验，忽略图像上下文，或者产生误导性的视觉表示，从而引发语言幻觉和视觉错觉的问题。为了深入研究并解决这些问题，研究人员Tianrui Guan等人创建了HallusionBench数据集。该数据集是一个针对VLMs的图像上下文推理基准，旨在挑战并评估这些模型在处理图像上下文问题时的表现。通过提供详细的案例分析和见解，HallusionBench为理解和改进VLMs提供了新的视角。

当前挑战

HallusionBench数据集主要面临两大挑战。首先是领域问题挑战，即如何准确评估VLMs在图像上下文推理任务中的表现。数据集包含视觉依赖（VD）和视觉补充（VS）两类问题，分别考察模型对图像信息的依赖程度和视觉信息对答案的影响。其次是构建过程中的挑战，包括如何设计合理的评估指标，确保模型的回答与图像上下文的一致性，以及如何构建一个公平且具有挑战性的数据集，以推动VLMs的研究和发展。

常用场景

经典使用场景

HallusionBench数据集专为大型视觉语言模型（VLMs）设计，旨在揭示其在图像上下文推理任务中的语言幻觉和视觉错觉问题。该数据集包含挑战性的图像上下文推理问题，能够测试模型是否过度依赖语言先验或视觉模块的误导性表示。通过提供详细的案例分析和洞见，HallusionBench帮助研究者深入了解VLMs的幻觉和错觉现象，并探索改进模型未来性能的途径。

衍生相关工作

HallusionBench数据集的发布衍生了许多相关研究。例如，一些研究利用HallusionBench数据集来评估和改进VLMs的性能，探索缓解语言幻觉和视觉错觉问题的有效方法。此外，一些研究还基于HallusionBench数据集提出新的VLMs架构和训练策略，以提升模型在图像上下文推理任务中的表现。这些研究不仅推动了VLMs技术的发展，也为解决图像理解和视觉问答等实际问题提供了新的思路和方法。

数据集最近研究