rayguan/HallusionBench

Name: rayguan/HallusionBench
Creator: rayguan
Published: 2023-12-10 18:14:47
License: 暂无描述

Hugging Face2023-12-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/rayguan/HallusionBench

下载链接

链接失效反馈

官方服务：

资源简介：

# HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models [Tianrui Guan*](https://tianruiguan.phd), [Fuxiao Liu*](https://fuxiaoliu.github.io/), Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou 🔥🔥🔥 ## We welcome everyone to contribute the failure cases of Large Multimodal Models (GPT-4V) to our community! 🔥🔥🔥 Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs. To study these two types of VLM mistakes, i.e., language hallucination and visual illusion, we curated HallusionBench, an image-context reasoning benchmark that is still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed analysis of examples in HallusionBench, which sheds novel insights on the illusion or hallucination of VLMs and how to improve them in the future. If you find our paper useful, please cite our paper: ```bibtex @misc{guan2023hallusionbench, title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models}, author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou}, year={2023}, eprint={2310.14566}, archivePrefix={arXiv}, primaryClass={cs.CV} } @misc{liu2023mitigating, title={Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning}, author={Fuxiao Liu and Kevin Lin and Linjie Li and Jianfeng Wang and Yaser Yacoob and Lijuan Wang}, year={2023}, eprint={2306.14565}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` ## Updates - [11/28] 🔥 The full paper is uploaded and can be accessed [here](https://arxiv.org/abs/2310.14566). The dataset is expanded and leaderboard is updated. - [11/13] 🔥 Evaluation result on LLaVA-1.5 is updated. More model results to come! - [10/27] 🔥 The [leaderboard](https://paperswithcode.com/sota/visual-question-answering-vqa-on-3) and evaluation code is released! **Welcome to update your model on our leaderboard!** - [10/24] 🔥 The early report with case analysis and insights is available [here](https://arxiv.org/abs/2310.14566). - [10/23] 🔥 Please check our previous work on mitigating hallucinations of LMMs ["Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning"](https://github.com/FuxiaoLiu/LRV-Instruction). ## Dataset Download To keep evaluation simple, we only provide the question in form of yes/no questions. | Updated on | Questions and Annotations | Figures | Question Count | Figure Count | | ----------- | :----: | :----: | :----: | :----: | | Oct 27, 2023 | [HallusionBench.json](./HallusionBench.json) | [hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) | 254 | 69 | ### Evaluation 1. Clone the repo. ``` git clone https://github.com/tianyi-lab/HallusionBench.git cd ./HallusionBench ``` 2. Download the images [hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) and unzip the folder in the same directory. 3. The questions and image locations are saved in `./HallusionBench.json`. The data sample are as follows: ``` {'category': 'VD', 'subcategory': 'illusion', 'visual_input': '1', 'set_id': '0', 'figure_id': '0', 'sample_note': 'circle', 'question_id': '0', 'question': 'Is the right orange circle the same size as the left orange circle?', 'gt_answer_details': 'The right orange circle is the same size as the left orange circle.', 'gt_answer': '1', 'filename': './hallusion_bench/VD/illusion/0_0.png'} ``` The key `visual_input`means whether the question needs visual input like images. If `visual_input=1`, it means the question need visual input. If `visual_input=0`, it means the question doesn't need visual input. It's the text-only question. 4. Run your model on `./HallusionBench.json` and save the ouput file as `./HallusionBench_result.json`. You need to add the output of your model in the key `'model_prediction'`. We provide an sample result [here](./HallusionBench_result_sample.json). 5. Finally, run the following code for evaluation: ``` python evaluation.py ``` You can use your own API key for GPT4 evaluation by editing the code [here](./utils.py#L10). ## Leaderboard ### Definition * **Visual Dependent (VD) Questions**: questions that do not have an affirmative answer without the visual context. * **Easy**: Original images that are obtained from Internet. * **Hard**: Edited images from the original images. * **Visual Supplement (VS) Questions**: questions that can be answered without the visual input; the visual component merely provides supplemental information. * **Easy**: No visual input. Uncertain answer without hallucination is also considered correct response. * **Hard**: With visual input. The answer must follow the provided figure and visual context. ### Metric * **Accuracy per Figure (Consistency Test)**: Accuracy based on each figure. To make sure the mode truly understand image, we ask variant of questions based on the same knowledge on the same figure, and consider it correct if the model can answer all questions correctly. For example, the model should not give inconsistent responses on the questions "Is A bigger than B?" and "Is B smaller A?". * **Accuracy per Question**: Accuracy of all questions, including easy and hard questions. * **Accuracy per Question Pair**: We ask the same questions on similar images (or, with and without images). We consider the same question text on different visual contexts a **question pair** (usually they come in with an *easy* question and a corresponding *hard* question). This metric calculate accuracy of all question pairs. | Model | Question Pair Acc | Figure Acc | Easy Question Acc | Hard Question Acc | Question Acc | Json | | ----- | :----: | :----: | :----: | :----: | :----: | :----: | | **GPT4V** Sep 25, 2023 Version (Human Eval) | 31.42 | 44.22 | 79.56 | 38.37 | 67.58 | [VD](), [VS]() | | **GPT4V** Sep 25, 2023 Version (GPT Eval) | 28.79 | 39.88 | 75.60 | 37.67 | 65.28 | [VD](), [VS]() | | **LLaVA-1.5** (Human Eval) | 9.45 | 25.43 | 50.77 | 29.07 | 47.12 | [VD](), [VS]() | | **LLaVA-1.5** (GPT Eval) | 10.55 | 24.86 | 49.67 | 29.77 | 46.94 | [VD](), [VS]() | | **BLIP2-T5** (GPT Eval) | 15.16 | 20.52 | 45.49 | 43.49 | 48.09 | [VD](), [VS]() | | **InstructBLIP** (GPT Eval) | 9.45 | 10.11 | 35.60 | 45.12 | 45.26 | [VD](), [VS]() | | **Qwen-VL** (GPT Eval) | 5.93 | 6.65 | 31.43 | 24.88 | 39.15 | [VD](), [VS]() | | **Open-Flamingo** (GPT Eval) | 6.37 | 11.27 | 39.56 | 27.21 | 38.44 | [VD](), [VS]() | | **MiniGPT5** (GPT Eval) |10.55 | 9.83 | 36.04| 28.37 | 40.30 | [VD](), [VS]() | | **MiniGPT4** (GPT Eval) |8.79 | 10.12 | 31.87| 27.67 | 35.78 | [VD](), [VS]() | | **mPLUG_Owl-v2** (GPT Eval) |13.85 | 19.94 | 44.84| 39.07 | 47.30 | [VD](), [VS]() | | **mPLUG_Owl-v1** (GPT Eval) |9.45 | 10.40 | 39.34| 29.77 | 43.93 | [VD](), [VS]() | | **GiT** (GPT Eval) |5.27 | 6.36 | 26.81| 31.86 | 34.37 | [VD](), [VS]() | ### Reproduce GPT4V results on leaderboard 1. We saved the ouput of GPT4V with our annotation. Put `HallusionBench.tsv` in the root directory of this repo, or set `input_file_name` in [gpt4v_benchmark.py](./gpt4v_benchmark.py) to the location of the [HallusionBench.tsv](https://drive.google.com/file/d/1q8db7-7IlA4WLZ_5Jt-TpLDyAWg8Ybx4/view?usp=sharing) file. 2. (Optional) If you don't have access to GPT API, you don't need to run it since we have saved evaluation results. They can be downloaded for [Visual Dependent]() and [Visual Supplement](). Put the json files in the root directory of this repo, or set `save_json_path_vd` and `save_json_path_vd` in [gpt4v_benchmark.py](./gpt4v_benchmark.py) to their respective locations. 3. Run `python gpt4v_benchmark.py`. ## Examples and Analysis <img src="./examples/f-01.png" alt="Example 1" class="center" width="800"/> <img src="./examples/f-02.png" alt="Example 2" class="center" width="800"/> <img src="./examples/f-04.png" alt="Example 3" class="center" width="800"/> <img src="./examples/f-05.png" alt="Example 4" class="center" width="800"/> <img src="./examples/f-08.png" alt="Example 5" class="center" width="800"/> <img src="./examples/f-15.png" alt="Example 6" class="center" width="800"/> <img src="./examples/f-10.png" alt="Example 7" class="center" width="800"/> <img src="./examples/f-12.png" alt="Example 8" class="center" width="800"/> <img src="./examples/f-17.png" alt="Example 9" class="center" width="800"/> --- license: bsd-3-clause ---

# HallusionBench：面向大视觉语言模型（Large Vision-Language Models, VLMs）中交织型语言幻觉与视觉错觉的高级诊断套件你所见即所思？抑或你所思即所见？一款对GPT-4V、LLaVA-1.5及其他多模态模型极具挑战性的图像上下文推理基准测试集 [Tianrui Guan*](https://tianruiguan.phd), [Fuxiao Liu*](https://fuxiaoliu.github.io/), Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou 🔥🔥🔥 ## 我们欢迎所有人向我们的社区贡献大型多模态模型（Large Multimodal Models, LMMs，如GPT-4V）的失效案例！ 🔥🔥🔥 大语言模型（Large Language Model, LLM）与视觉模型对齐后集成至视觉语言模型（VLMs），可在图像推理任务中实现显著性能提升，近期发布的GPT-4V、LLaVA-1.5等模型便印证了这一点。然而，这些主流大视觉语言模型中强大的语言先验实则是一把双刃剑：它们可能忽略图像上下文，仅依赖（甚至与图像矛盾的）语言先验进行推理。与之相对，VLMs中的视觉模块弱于语言模型，可能生成误导性的视觉表征，进而被语言模型转化为笃定的错误结论。为研究这两类VLMs失误——即语言幻觉与视觉错觉——我们精心构建了HallusionBench，一款即使对GPT-4V与LLaVA-1.5仍极具挑战性的图像上下文推理基准测试集。我们对HallusionBench中的样本展开了详尽分析，为理解VLMs的错觉或幻觉问题以及未来的改进方向提供了全新视角。如果您认为我们的研究工作具有价值，请引用我们的论文： bibtex @misc{guan2023hallusionbench, title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models}, author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou}, year={2023}, eprint={2310.14566}, archivePrefix={arXiv}, primaryClass={cs.CV} } @misc{liu2023mitigating, title={Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning}, author={Fuxiao Liu and Kevin Lin and Linjie Li and Jianfeng Wang and Yaser Yacoob and Lijuan Wang}, year={2023}, eprint={2306.14566}, archivePrefix={arXiv}, primaryClass={cs.CV} } ## 更新日志 - [11/28] 🔥 完整论文已上传，可通过[此链接](https://arxiv.org/abs/2310.14566)访问。数据集已扩充，排行榜亦已更新。 - [11/13] 🔥 更新了LLaVA-1.5的评估结果，更多模型结果即将上线！ - [10/27] 🔥 [排行榜](https://paperswithcode.com/sota/visual-question-answering-vqa-on-3)与评估代码已发布！**欢迎将您的模型更新至我们的排行榜！** - [10/24] 🔥 包含案例分析与核心见解的预印本已可通过[此链接](https://arxiv.org/abs/2310.14566)获取。 - [10/23] 🔥 请参阅我们此前关于缓解多模态模型幻觉的研究工作：《通过鲁棒指令微调缓解大型多模态模型中的幻觉问题》（[Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning](https://github.com/FuxiaoLiu/LRV-Instruction)）。 ## 数据集下载为简化评估流程，我们仅提供以是非问答形式呈现的问题。 | 更新时间 | 问题与标注 | 图像文件 | 问题数量 | 图像数量 | | ----------- | :----: | :----: | :----: | :----: | | 2023年10月27日 | [HallusionBench.json](./HallusionBench.json) | [hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) | 254 | 69 | ### 评估流程 1. 克隆仓库 git clone https://github.com/tianyi-lab/HallusionBench.git cd ./HallusionBench 2. 下载图像压缩包[hallusion_bench.zip](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing)，并将其解压至与仓库同级的目录中。 3. 问题与图像路径信息存储于`./HallusionBench.json`中，数据样本格式如下： {'category': 'VD', 'subcategory': 'illusion', 'visual_input': '1', 'set_id': '0', 'figure_id': '0', 'sample_note': 'circle', 'question_id': '0', 'question': "Is the right orange circle the same size as the left orange circle?", 'gt_answer_details': "The right orange circle is the same size as the left orange circle.", 'gt_answer': '1', 'filename': './hallusion_bench/VD/illusion/0_0.png'} 其中关键字段`visual_input`用于标识问题是否需要图像输入：若`visual_input=1`，则该问题需要图像输入；若`visual_input=0`，则该问题无需图像输入，属于纯文本问题。 4. 在`./HallusionBench.json`上运行您的模型，并将输出结果保存为`./HallusionBench_result.json`，请将模型预测结果存入键`'model_prediction'`中。我们提供了一份示例结果[此处](./HallusionBench_result_sample.json)。 5. 最后，运行以下代码完成评估： python evaluation.py 您可通过修改[utils.py文件第10行](./utils.py#L10)的配置，使用您自己的API密钥调用GPT-4进行评估。 ## 排行榜 ### 任务定义 * **视觉依赖型（Visual Dependent, VD）问题**：若无视觉上下文则无法给出确切答案的问题。 * **简单样本**：源自互联网的原始图像。 * **困难样本**：基于原始图像编辑得到的图像。 * **视觉补充型（Visual Supplement, VS）问题**：无需视觉输入即可作答的问题，视觉组件仅提供补充信息。 * **简单样本**：无视觉输入。若无幻觉情况下得到的不确定答案亦视为正确响应。 * **困难样本**：带有视觉输入。答案必须遵循所提供的图像与视觉上下文。 ### 评估指标 * **单图像准确率（一致性测试）**：基于单张图像的准确率。为确保模型真正理解图像内容，我们针对同一张图像的相关知识点设计多个变体问题，若模型能正确回答所有相关问题，则视为通过该测试。例如，模型不应在"橙色右圆与左圆尺寸是否一致？"与"橙色左圆与右圆尺寸是否一致？"这两个问题上给出矛盾的回答。 * **单问题准确率**：所有问题的整体准确率，包含简单与困难样本。 * **问题对准确率**：我们针对相似图像（或有无图像的场景）提出相同问题，将同一问题文本在不同视觉上下文下的测试视为**问题对**（通常对应一个简单样本与一个对应的困难样本）。该指标用于计算所有问题对的准确率。 | 模型 | 问题对准确率 | 单图像准确率 | 简单问题准确率 | 困难问题准确率 | 单问题准确率 | 结果文件 | | ----- | :----: | :----: | :----: | :----: | :----: | :----: | | **GPT4V** 2023年9月25日版本 (人工评估) | 31.42 | 44.22 | 79.56 | 38.37 | 67.58 | [VD](), [VS]() | | **GPT4V** 2023年9月25日版本 (GPT评估) | 28.79 | 39.88 | 75.60 | 37.67 | 65.28 | [VD](), [VS]() | | **LLaVA-1.5** (人工评估) | 9.45 | 25.43 | 50.77 | 29.07 | 47.12 | [VD](), [VS]() | | **LLaVA-1.5** (GPT评估) | 10.55 | 24.86 | 49.67 | 29.77 | 46.94 | [VD](), [VS]() | | **BLIP2-T5** (GPT评估) | 15.16 | 20.52 | 45.49 | 43.49 | 48.09 | [VD](), [VS]() | | **InstructBLIP** (GPT评估) | 9.45 | 10.11 | 35.60 | 45.12 | 45.26 | [VD](), [VS]() | | **Qwen-VL** (GPT评估) | 5.93 | 6.65 | 31.43 | 24.88 | 39.15 | [VD](), [VS]() | | **Open-Flamingo** (GPT评估) | 6.37 | 11.27 | 39.56 | 27.21 | 38.44 | [VD](), [VS]() | | **MiniGPT5** (GPT评估) |10.55 | 9.83 | 36.04| 28.37 | 40.30 | [VD](), [VS]() | | **MiniGPT4** (GPT评估) |8.79 | 10.12 | 31.87| 27.67 | 35.78 | [VD](), [VS]() | | **mPLUG_Owl-v2** (GPT评估) |13.85 | 19.94 | 44.84| 39.07 | 47.30 | [VD](), [VS]() | | **mPLUG_Owl-v1** (GPT评估) |9.45 | 10.40 | 39.34| 29.77 | 43.93 | [VD](), [VS]() | | **GiT** (GPT评估) |5.27 | 6.36 | 26.81| 31.86 | 34.37 | [VD](), [VS]() | ### 复现排行榜上的GPT4V结果 1. 我们已保存了GPT4V的输出与对应的标注结果。将`HallusionBench.tsv`放入本仓库的根目录，或在[gpt4v_benchmark.py](./gpt4v_benchmark.py)中修改`input_file_name`字段，将其设置为`HallusionBench.tsv`文件的实际路径。 2. （可选）若您无法访问GPT API，无需运行该脚本，因为我们已保存了评估结果，可分别从[视觉依赖型](https://drive.google.com/file/d/1q8db7-7IlA4WLZ_5Jt-TpLDyAWg8Ybx4/view?usp=sharing)与[视觉补充型]()的链接下载对应的JSON文件。将下载的JSON文件放入本仓库根目录，或在[gpt4v_benchmark.py](./gpt4v_benchmark.py)中分别修改`save_json_path_vd`与`save_json_path_vs`字段为对应文件的路径。 3. 运行`python gpt4v_benchmark.py`。 ## 示例与分析 <img src="./examples/f-01.png" alt="示例1" class="center" width="800"/> <img src="./examples/f-02.png" alt="示例2" class="center" width="800"/> <img src="./examples/f-04.png" alt="示例3" class="center" width="800"/> <img src="./examples/f-05.png" alt="示例4" class="center" width="800"/> <img src="./examples/f-08.png" alt="示例5" class="center" width="800"/> <img src="./examples/f-15.png" alt="示例6" class="center" width="800"/> <img src="./examples/f-10.png" alt="示例7" class="center" width="800"/> <img src="./examples/f-12.png" alt="示例8" class="center" width="800"/> <img src="./examples/f-17.png" alt="示例9" class="center" width="800"/> --- 许可证：BSD-3-clause ---

提供机构：

rayguan

原始信息汇总

HallusionBench 数据集概述

数据集描述

HallusionBench 是一个用于评估大型视觉-语言模型（VLMs）在图像推理任务中语言幻觉和视觉错觉问题的先进诊断套件。该数据集包含了一系列挑战性的图像-上下文推理问题，旨在揭示和分析 VLMs 在处理图像信息时的潜在问题。

数据集更新

最新更新日期：2023年10月27日
问题和标注文件：HallusionBench.json
图像文件：hallusion_bench.zip
问题数量：254
图像数量：69

数据集结构

数据集中的问题以是/否问题的形式提供，具体数据样本如下： json { "category": "VD", "subcategory": "illusion", "visual_input": "1", "set_id": "0", "figure_id": "0", "sample_note": "circle", "question_id": "0", "question": "Is the right orange circle the same size as the left orange circle?", "gt_answer_details": "The right orange circle is the same size as the left orange circle.", "gt_answer": "1", "filename": "./hallusion_bench/VD/illusion/0_0.png" }

visual_input 表示问题是否需要视觉输入。1 表示需要，0 表示不需要。

评估方法

克隆仓库： bash git clone https://github.com/tianyi-lab/HallusionBench.git cd ./HallusionBench
下载图像并解压： bash wget https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing unzip hallusion_bench.zip
运行模型并保存结果：
- 在 ./HallusionBench.json 上运行模型，并将输出保存为 ./HallusionBench_result.json。
- 在结果中添加 model_prediction 键。
评估模型： bash python evaluation.py

排行榜定义

视觉依赖（VD）问题：需要视觉上下文才能回答的问题。
- 简单：来自互联网的原始图像。
- 困难：从原始图像编辑的图像。
视觉补充（VS）问题：无需视觉输入即可回答的问题，视觉部分仅提供补充信息。
- 简单：无视觉输入，不确定的答案也被视为正确。
- 困难：有视觉输入，答案必须遵循提供的图像和视觉上下文。

评估指标

每图准确率（一致性测试）：基于每个图像的准确率，确保模型真正理解图像。
每问题准确率：所有问题的准确率，包括简单和困难问题。
每问题对准确率：在相似图像上提出相同问题，计算所有问题对的准确率。

排行榜

模型	问题对准确率	每图准确率	简单问题准确率	困难问题准确率	总问题准确率
GPT4V (Human Eval)	31.42	44.22	79.56	38.37	67.58
GPT4V (GPT Eval)	28.79	39.88	75.60	37.67	65.28
LLaVA-1.5 (Human Eval)	9.45	25.43	50.77	29.07	47.12
LLaVA-1.5 (GPT Eval)	10.55	24.86	49.67	29.77	46.94
BLIP2-T5 (GPT Eval)	15.16	20.52	45.49	43.49	48.09
InstructBLIP (GPT Eval)	9.45	10.11	35.60	45.12	45.26
Qwen-VL (GPT Eval)	5.93	6.65	31.43	24.88	39.15
Open-Flamingo (GPT Eval)	6.37	11.27	39.56	27.21	38.44
MiniGPT5 (GPT Eval)	10.55	9.83	36.04	28.37	40.30
MiniGPT4 (GPT Eval)	8.79	10.12	31.87	27.67	35.78
mPLUG_Owl-v2 (GPT Eval)	13.85	19.94	44.84	39.07	47.30
mPLUG_Owl-v1 (GPT Eval)	9.45	10.40	39.34	29.77	43.93
GiT (GPT Eval)	5.27	6.36	26.81	31.86	34.37

搜集汇总

数据集介绍

构建方式

HallusionBench数据集旨在探索大型视觉语言模型（VLMs）在图像推理任务中的语言幻觉和视觉错觉问题。该数据集包含两种类型的任务：视觉依赖（VD）问题和视觉补充（VS）问题。VD问题要求模型在无视觉输入的情况下无法给出肯定答案，而VS问题则可以在没有视觉输入的情况下回答。数据集包含了从互联网获取的原始图像和经过编辑的图像，以及对应的yes/no问题。为了确保模型的准确性，数据集中还包含了基于相同知识库的不同问题，要求模型在这些问题上给出一致性的答案。

特点

HallusionBench数据集的主要特点是它能够揭示大型视觉语言模型在图像推理任务中的缺陷。数据集中的问题设计旨在挑战GPT-4V、LLaVA-1.5等最先进的视觉语言模型。此外，数据集还提供了详细的案例分析，为研究人员提供了关于视觉语言模型幻觉或错觉的新见解，并为未来的改进提供了参考。

使用方法

使用HallusionBench数据集的方法包括：1. 克隆数据集的GitHub仓库。2. 下载并解压图像数据集。3. 运行模型进行预测，并将结果保存为JSON格式的文件。4. 使用提供的evaluation.py脚本对模型结果进行评估。评估指标包括每张图像的准确率、每个问题的准确率以及每对问题的准确率。用户还可以通过修改代码中的API密钥来使用GPT4进行评估。

背景与挑战

背景概述

在视觉语言模型（VLMs）领域，大型语言模型（LLMs）与视觉模型的结合已经取得了显著的进展，例如GPT-4V和LLaVA-1.5等模型的推出。然而，这些先进的模型在推理过程中往往过度依赖语言先验，忽略图像上下文，或者产生误导性的视觉表示，从而引发语言幻觉和视觉错觉的问题。为了深入研究并解决这些问题，研究人员Tianrui Guan等人创建了HallusionBench数据集。该数据集是一个针对VLMs的图像上下文推理基准，旨在挑战并评估这些模型在处理图像上下文问题时的表现。通过提供详细的案例分析和见解，HallusionBench为理解和改进VLMs提供了新的视角。

当前挑战

HallusionBench数据集主要面临两大挑战。首先是领域问题挑战，即如何准确评估VLMs在图像上下文推理任务中的表现。数据集包含视觉依赖（VD）和视觉补充（VS）两类问题，分别考察模型对图像信息的依赖程度和视觉信息对答案的影响。其次是构建过程中的挑战，包括如何设计合理的评估指标，确保模型的回答与图像上下文的一致性，以及如何构建一个公平且具有挑战性的数据集，以推动VLMs的研究和发展。

常用场景

经典使用场景

HallusionBench数据集专为大型视觉语言模型（VLMs）设计，旨在揭示其在图像上下文推理任务中的语言幻觉和视觉错觉问题。该数据集包含挑战性的图像上下文推理问题，能够测试模型是否过度依赖语言先验或视觉模块的误导性表示。通过提供详细的案例分析和洞见，HallusionBench帮助研究者深入了解VLMs的幻觉和错觉现象，并探索改进模型未来性能的途径。

衍生相关工作

HallusionBench数据集的发布衍生了许多相关研究。例如，一些研究利用HallusionBench数据集来评估和改进VLMs的性能，探索缓解语言幻觉和视觉错觉问题的有效方法。此外，一些研究还基于HallusionBench数据集提出新的VLMs架构和训练策略，以提升模型在图像上下文推理任务中的表现。这些研究不仅推动了VLMs技术的发展，也为解决图像理解和视觉问答等实际问题提供了新的思路和方法。

数据集最近研究