MM-NIAH

Name: MM-NIAH
Creator: maas
Published: 2025-12-04 16:19:36
License: 暂无描述

魔搭社区2025-12-04 更新2024-12-28 收录

下载链接：

https://modelscope.cn/datasets/OpenGVLab/MM-NIAH

下载链接

链接失效反馈

官方服务：

资源简介：

# <img width="60" alt="image" src="https://github.com/OpenGVLab/MM-NIAH/blob/main/assets/logo.png?raw=true"> Needle In A Multimodal Haystack [[Project Page](https://mm-niah.github.io/)] [[arXiv Paper](http://arxiv.org/abs/2406.07230)] [[Dataset](https://huggingface.co/datasets/OpenGVLab/MM-NIAH)] [[Leaderboard](https://mm-niah.github.io/#leaderboard_test)] [[Github](https://github.com/OpenGVLab/MM-NIAH)] ## News🚀🚀🚀 - `2024/06/13`: 🚀We release Needle In A Multimodal Haystack ([MM-NIAH](https://huggingface.co/OpenGVLab/MM-NIAH)), the first benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. **Experimental results show that performance of Gemini-1.5 on tasks with image needles is no better than a random guess.** ## Introduction Needle In A Multimodal Haystack (MM-NIAH) is a comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. This benchmark requires the model to answer specific questions according to the key information scattered throughout the multimodal document. The evaluation data in MM-NIAH consists of three tasks: `retrieval`, `counting`, and `reasoning`. The needles are inserted into either text or images in the documents. Those inserted into text are termed `text needles`, whereas those within images are referred to as `image needles`. Please see [our paper](http://arxiv.org/abs/2406.07230) for more details. <img width="800" alt="image" src="https://github.com/OpenGVLab/MM-NIAH/blob/main/assets/data_examples.jpg?raw=true"> ## Main Findingds Based on our benchmark, we conducted a series of experiments. The main findings are summarized as follows: - The most advanced MLLMs (e.g. Gemini-1.5) still struggle to comprehend multimodal documents. - **All MLLMs exhibit poor performance on image needles.** - MLLMs fail to recognize the exact number of images in the document. - Models pre-trained on image-text interleaved data do not exhibit superior performance. - Training on background documents does not boost performance on MM-NIAH. - The "Lost in the Middle" problem also exists in MLLMs. - Long context capability of LLMs is NOT retained in MLLMs. - RAG boosts Text Needle Retrieval but not Image Needle Retrieval. - Placing questions before context does NOT improve model performance. - Humans achieve near-perfect performance on MM-NIAH. Please see [our paper](http://arxiv.org/abs/2406.07230) for more detailed analyses. ## Experimental Results For the retrieval and reasoning tasks, we utilize Accuracy as the evaluation metric. For the counting task, we use Soft Accuracy, defined as $\frac{1}{N} \sum_{i=1}^{N} \frac{m_i}{M_i}$, where $m_i$ is the number of matched elements in the corresponding positions between the predicted and ground-truth lists and $M_i$ is the number of elements in the ground-truth list for the $i$-th sample. Note that the required output for this task is a list. <img width="800" alt="image" src="https://github.com/OpenGVLab/MM-NIAH/blob/main/assets/main_table.jpg?raw=true">   <img width="800" alt="image" src="https://github.com/OpenGVLab/MM-NIAH/blob/main/assets/main_heatmap.jpg?raw=true">    <img width="800" alt="image" src="https://github.com/OpenGVLab/MM-NIAH/blob/main/assets/subtasks_table.jpg?raw=true">  ## Evaluation To calculate the scores, please prepare the model responses in jsonl format, like this [example](https://github.com/OpenGVLab/MM-NIAH/outputs_example/example-retrieval-text.jsonl). Then you can place all jsonl files in a single folder and execute our script [calculate_scores.py](https://github.com/OpenGVLab/MM-NIAH/calculate_scores.py) to get the heatmaps and scores. ```shell python calculate_scores.py --outputs-dir /path/to/your/responses ``` For example, if you want to reproduce the experimental results of [InternVL-1.5](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5), you should first install the environment following [the document](https://github.com/OpenGVLab/InternVL/blob/main/INSTALLATION.md) and download [the checkpoints](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5). Then you can execute the evaluation script [eval_internvl.py](https://github.com/OpenGVLab/MM-NIAH/eval_internvl.py) for InternVL to obtain the results, using the following commands: ```shell sh shells/eval_internvl.sh python calculate_scores.py --outputs-dir ./outputs/ ``` If you want to reproduce the results of InternVL-1.5-RAG, please first prepare the retrieved segments using the following commands: ```shell sh shells/prepare_rag.sh ``` Then, run these commands to obtain the results of InternVL-1.5-RAG: ```shell sh shells/eval_internvl_rag.sh python calculate_scores.py --outputs-dir ./outputs/ ``` `NOTE`: Make sure that you install the [flash-attention](https://github.com/Dao-AILab/flash-attention) successfully, otherwise you will meet the torch.cuda.OutOfMemoryError. ## Leaderboard 🚨🚨 The leaderboard is continuously being updated. To submit your results to the leaderboard on MM-NIAH, please send to [this email](mailto:wangweiyun@pjlab.org.cn) with your result jsonl files on each task, referring to the template file [example-retrieval-text.jsonl](https://github.com/OpenGVLab/MM-NIAH/outputs_example/example-retrieval-text.jsonl). Please organize the result jsonl files as follows: ``` ├── ${model_name}_retrieval-text-val.jsonl ├── ${model_name}_retrieval-image-val.jsonl ├── ${model_name}_counting-text-val.jsonl ├── ${model_name}_counting-image-val.jsonl ├── ${model_name}_reasoning-text-val.jsonl ├── ${model_name}_reasoning-image-val.jsonl ├── ├── ${model_name}_retrieval-text-test.jsonl ├── ${model_name}_retrieval-image-test.jsonl ├── ${model_name}_counting-text-test.jsonl ├── ${model_name}_counting-image-test.jsonl ├── ${model_name}_reasoning-text-test.jsonl └── ${model_name}_reasoning-image-test.jsonl ``` ## Visualization If you want to visualize samples in MM-NIAH, please install `gradio==3.43.2` and run this script [visualization.py](https://github.com/OpenGVLab/MM-NIAH/visualization.py). ## Data Format ```python { # int, starting from 0, each task type has independent ids. "id": xxx, # List of length N, where N is the number of images. Each element is a string representing the relative path of the image. The image contained in the "choices" is not included here, only the images in the "context" and "question" are recorded. "images_list": [ "xxx", "xxx", "xxx" ], # str, multimodal haystack, "<image>" is used as the image placeholder. "context": "xxx", # str, question "question": "xxx", # Union[str, int, List], records the standard answer. Open-ended questions are str or List (counting task), multiple-choice questions are int "answer": "xxx", # meta_info, records various statistics "meta": { # Union[float, List[float]], range [0,1], position of the needle. If multiple needles are inserted, it is List[float]. "placed_depth": xxx, # int, number of text and visual tokens "context_length": xxx, # int, number of text tokens "context_length_text": xxx, # int, number of image tokens "context_length_image": xxx, # int, number of images "num_images": xxx, # List[str], inserted needles. If it is a text needle, record the text; if it is an image needle, record the relative path of the image. "needles": [xxx, ..., xxx], # List[str], candidate text answers. If it is not a multiple-choice question or there are no text candidates, write None. "choices": [xxx, ..., xxx], # List[str], candidate image answers. The relative path of the image. If it is not a multiple-choice question or there are no image candidates, write None. "choices_image_path": [xxx, ..., xxx], } } ``` `NOTE 1`: The number of `<image>` in the context and question equates to the length of the `images_list`. `NOTE 2`: Save as a jsonl file, each line is a `Dict`. ## Contact - Weiyun Wang: wangweiyun@pjlab.org.cn - Wenhai Wang: wangwenhai@pjlab.org.cn - Wenqi Shao: shaowenqi@pjlab.org.cn ## Acknowledgement The multimodal haystack of MM-NIAH is build upon the documents from [OBELICS](https://github.com/huggingface/OBELICS). Besides, our project page is adapted from [Nerfies](https://github.com/nerfies/nerfies.github.io) and [MathVista](https://github.com/lupantech/MathVista). Thanks for their awesome work! ## Citation ```BibTex @article{wang2024needle, title={Needle In A Multimodal Haystack}, author={Wang, Weiyun and Zhang, Shuibo and Ren, Yiming and Duan, Yuchen and Li, Tiantong and Liu, Shuo and Hu, Mengkang and Chen, Zhe and Zhang, Kaipeng and Lu, Lewei and others}, journal={arXiv preprint arXiv:2406.07230}, year={2024} } ```

# <img width="60" alt="image" src="https://github.com/OpenGVLab/MM-NIAH/blob/main/assets/logo.png?raw=true"> 多模态干草堆中的寻针（Needle In A Multimodal Haystack，MM-NIAH） [[项目页面](https://mm-niah.github.io/)] [[arXiv论文](http://arxiv.org/abs/2406.07230)] [[数据集](https://huggingface.co/datasets/OpenGVLab/MM-NIAH)] [[排行榜](https://mm-niah.github.io/#leaderboard_test)] [[Github仓库](https://github.com/OpenGVLab/MM-NIAH)] ## 动态 🚀🚀🚀 - `2024/06/13`: 🚀我们发布了多模态干草堆中的寻针（MM-NIAH），这是首个旨在系统性评估现有多模态大语言模型（Multimodal Large Language Model，MLLM）理解长多模态文档能力的基准测试集。**实验结果表明，Gemini-1.5在图像寻针任务上的表现与随机猜测无异。** ## 简介多模态干草堆中的寻针（MM-NIAH）是一套全面的基准测试集，旨在系统性评估现有多模态大语言模型（MLLM）理解长多模态文档的能力。该基准测试要求模型根据分散在多模态文档各处的关键信息回答特定问题。MM-NIAH中的评估数据包含三类任务：`检索`、`计数`与`推理`。插入文档中的寻针目标可分为两类：嵌入文本的称为`文本寻针`，嵌入图像的则称为`图像寻针`。更多细节请参阅[我们的论文](http://arxiv.org/abs/2406.07230)。 <img width="800" alt="image" src="https://github.com/OpenGVLab/MM-NIAH/blob/main/assets/data_examples.jpg?raw=true"> ## 主要发现基于本基准测试集，我们开展了一系列实验，主要发现总结如下： - 最先进的MLLM（如Gemini-1.5）在理解多模态文档方面仍存在较大困难。 - **所有MLLM在图像寻针任务上均表现不佳。** - MLLM无法准确识别文档中的图像总数。 - 在图文交错数据上预训练的模型并未展现出更优的性能。 - 在背景文档上进行训练并不能提升MM-NIAH上的表现。 - “中间迷失（Lost in the Middle）”问题同样存在于MLLM中。 - 大语言模型（Large Language Model，LLM）的长上下文能力并未在MLLM中得到保留。 - 检索增强生成（Retrieval-Augmented Generation，RAG）可提升文本寻针检索任务的性能，但无法提升图像寻针检索任务的性能。 - 将问题置于上下文之前并不会提升模型性能。 - 人类在MM-NIAH上的表现近乎完美。更多详细分析请参阅[我们的论文](http://arxiv.org/abs/2406.07230)。 ## 实验结果对于检索与推理任务，我们采用准确率（Accuracy）作为评估指标。对于计数任务，我们使用软准确率（Soft Accuracy），其定义为$frac{1}{N} sum_{i=1}^{N} frac{m_i}{M_i}$，其中$m_i$为第$i$个样本的预测列表与标准答案列表在对应位置上匹配的元素数量，$M_i$为标准答案列表中的元素总数。需注意，该任务要求输出为列表。 <img width="800" alt="image" src="https://github.com/OpenGVLab/MM-NIAH/blob/main/assets/main_table.jpg?raw=true"> <img width="800" alt="image" src="https://github.com/OpenGVLab/MM-NIAH/blob/main/assets/main_heatmap.jpg?raw=true"> <img width="800" alt="image" src="https://github.com/OpenGVLab/MM-NIAH/blob/main/assets/subtasks_table.jpg?raw=true"> ## 评估方式若要计算得分，请将模型的输出整理为jsonl格式，示例可参见[此处](https://github.com/OpenGVLab/MM-NIAH/outputs_example/example-retrieval-text.jsonl)。随后将所有jsonl文件置于同一文件夹中，运行我们提供的脚本[calculate_scores.py](https://github.com/OpenGVLab/MM-NIAH/calculate_scores.py)即可生成热力图与得分。 shell python calculate_scores.py --outputs-dir /path/to/your/responses 例如，若要复现[InternVL-1.5](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)的实验结果，请先按照[安装文档](https://github.com/OpenGVLab/InternVL/blob/main/INSTALLATION.md)配置环境并下载[模型权重](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)。随后运行针对InternVL的评估脚本[eval_internvl.py](https://github.com/OpenGVLab/MM-NIAH/eval_internvl.py)以获取结果，执行命令如下： shell sh shells/eval_internvl.sh python calculate_scores.py --outputs-dir ./outputs/ 若要复现InternVL-1.5-RAG的结果，请先通过以下命令准备检索得到的片段： shell sh shells/prepare_rag.sh 随后运行以下命令以获取InternVL-1.5-RAG的结果： shell sh shells/eval_internvl_rag.sh python calculate_scores.py --outputs-dir ./outputs/ `注意`：请确保已成功安装[flash-attention](https://github.com/Dao-AILab/flash-attention)，否则将遇到torch.cuda.OutOfMemoryError（显存不足错误）。 ## 排行榜 🚨🚨 排行榜仍在持续更新中。若要将你的结果提交至MM-NIAH的排行榜，请将各任务的结果jsonl文件发送至[此邮箱](mailto:wangweiyun@pjlab.org.cn)，并参考示例文件[example-retrieval-text.jsonl](https://github.com/OpenGVLab/MM-NIAH/outputs_example/example-retrieval-text.jsonl)的格式。请按照以下结构组织结果jsonl文件： ├── ${model_name}_retrieval-text-val.jsonl ├── ${model_name}_retrieval-image-val.jsonl ├── ${model_name}_counting-text-val.jsonl ├── ${model_name}_counting-image-val.jsonl ├── ${model_name}_reasoning-text-val.jsonl ├── ${model_name}_reasoning-image-val.jsonl ├── ├── ${model_name}_retrieval-text-test.jsonl ├── ${model_name}_retrieval-image-test.jsonl ├── ${model_name}_counting-text-test.jsonl ├── ${model_name}_counting-image-test.jsonl ├── ${model_name}_reasoning-text-test.jsonl └── ${model_name}_reasoning-image-test.jsonl ## 可视化若要可视化MM-NIAH中的样本，请安装`gradio==3.43.2`并运行脚本[visualization.py](https://github.com/OpenGVLab/MM-NIAH/visualization.py)。 ## 数据格式 python { # 整数，从0开始计数，每种任务类型拥有独立的ID编号。 "id": xxx, # 长度为N的列表，N为图像总数。列表中每个元素为代表图像相对路径的字符串。“选项（choices）”中包含的图像不计入此处，仅记录“上下文（context）”与“问题（question）”中的图像。 "images_list": [ "xxx", "xxx", "xxx" ], # 字符串，多模态干草堆文档，使用"<image>"作为图像占位符。 "context": "xxx", # 字符串，问题内容。 "question": "xxx", # 字符串、整数或列表类型，记录标准答案。开放式问题的答案为字符串或列表（计数任务），选择题的答案为整数。 "answer": "xxx", # 元信息，记录各类统计数据。 "meta": { # 浮点数或浮点数列表，取值范围为[0,1]，表示寻针目标的位置。若存在多个寻针目标，则为浮点数列表。 "placed_depth": xxx, # 整数，文本与视觉Token的总数量。 "context_length": xxx, # 整数，文本Token的数量。 "context_length_text": xxx, # 整数，图像Token的数量。 "context_length_image": xxx, # 整数，图像总数量。 "num_images": xxx, # 字符串列表，插入的寻针目标。若为文本寻针，则记录对应文本；若为图像寻针，则记录图像的相对路径。 "needles": [xxx, ..., xxx], # 字符串列表，候选文本答案。若不为选择题或无文本候选答案，则填写None。 "choices": [xxx, ..., xxx], # 字符串列表，候选图像答案的相对路径。若不为选择题或无图像候选答案，则填写None。 "choices_image_path": [xxx, ..., xxx] } } `注意1`：上下文与问题中出现的`<image>`占位符数量与`images_list`的长度一致。 `注意2`：结果需保存为jsonl文件，每一行对应一个字典（Dict）。 ## 联系方式 - 王伟云：wangweiyun@pjlab.org.cn - 王文海：wangwenhai@pjlab.org.cn - 邵文琦：shaowenqi@pjlab.org.cn ## 致谢 MM-NIAH的多模态干草堆文档基于[OBELICS](https://github.com/huggingface/OBELICS)的文档构建。此外，本项目的页面改编自[Nerfies](https://github.com/nerfies/nerfies.github.io)与[MathVista](https://github.com/lupantech/MathVista)。感谢他们的出色工作！ ## 引用 BibTex @article{wang2024needle, title={Needle In A Multimodal Haystack}, author={Wang, Weiyun and Zhang, Shuibo and Ren, Yiming and Duan, Yuchen and Li, Tiantong and Liu, Shuo and Hu, Mengkang and Chen, Zhe and Zhang, Kaipeng and Lu, Lewei and others}, journal={arXiv preprint arXiv:2406.07230}, year={2024} }

提供机构：

maas

创建时间：

2024-12-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集