MM-NIAH|多模态理解数据集|语言模型评估数据集

魔搭社区2025-10-09 更新2024-12-28 收录

多模态理解

语言模型评估

下载链接：

https://modelscope.cn/datasets/OpenGVLab/MM-NIAH

下载链接

链接失效反馈

资源简介：

# Needle In A Multimodal Haystack [[Project Page](https://mm-niah.github.io/)] [[arXiv Paper](http://arxiv.org/abs/2406.07230)] [[Dataset](https://huggingface.co/datasets/OpenGVLab/MM-NIAH)] [[Leaderboard](https://mm-niah.github.io/#leaderboard_test)] [[Github](https://github.com/OpenGVLab/MM-NIAH)] ## News🚀🚀🚀 - `2024/06/13`: 🚀We release Needle In A Multimodal Haystack ([MM-NIAH](https://huggingface.co/OpenGVLab/MM-NIAH)), the first benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. **Experimental results show that performance of Gemini-1.5 on tasks with image needles is no better than a random guess.** ## Introduction Needle In A Multimodal Haystack (MM-NIAH) is a comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. This benchmark requires the model to answer specific questions according to the key information scattered throughout the multimodal document. The evaluation data in MM-NIAH consists of three tasks: `retrieval`, `counting`, and `reasoning`. The needles are inserted into either text or images in the documents. Those inserted into text are termed `text needles`, whereas those within images are referred to as `image needles`. Please see [our paper](http://arxiv.org/abs/2406.07230) for more details. ## Main Findingds Based on our benchmark, we conducted a series of experiments. The main findings are summarized as follows: - The most advanced MLLMs (e.g. Gemini-1.5) still struggle to comprehend multimodal documents. - **All MLLMs exhibit poor performance on image needles.** - MLLMs fail to recognize the exact number of images in the document. - Models pre-trained on image-text interleaved data do not exhibit superior performance. - Training on background documents does not boost performance on MM-NIAH. - The "Lost in the Middle" problem also exists in MLLMs. - Long context capability of LLMs is NOT retained in MLLMs. - RAG boosts Text Needle Retrieval but not Image Needle Retrieval. - Placing questions before context does NOT improve model performance. - Humans achieve near-perfect performance on MM-NIAH. Please see [our paper](http://arxiv.org/abs/2406.07230) for more detailed analyses. ## Experimental Results For the retrieval and reasoning tasks, we utilize Accuracy as the evaluation metric. For the counting task, we use Soft Accuracy, defined as $\frac{1}{N} \sum_{i=1}^{N} \frac{m_i}{M_i}$, where $m_i$ is the number of matched elements in the corresponding positions between the predicted and ground-truth lists and $M_i$ is the number of elements in the ground-truth list for the $i$-th sample. Note that the required output for this task is a list. --> Heatmaps (click to expand) --> --> --> Tables (click to expand) --> --> ## Evaluation To calculate the scores, please prepare the model responses in jsonl format, like this [example](https://github.com/OpenGVLab/MM-NIAH/outputs_example/example-retrieval-text.jsonl). Then you can place all jsonl files in a single folder and execute our script [calculate_scores.py](https://github.com/OpenGVLab/MM-NIAH/calculate_scores.py) to get the heatmaps and scores. ```shell python calculate_scores.py --outputs-dir /path/to/your/responses ``` For example, if you want to reproduce the experimental results of [InternVL-1.5](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5), you should first install the environment following [the document](https://github.com/OpenGVLab/InternVL/blob/main/INSTALLATION.md) and download [the checkpoints](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5). Then you can execute the evaluation script [eval_internvl.py](https://github.com/OpenGVLab/MM-NIAH/eval_internvl.py) for InternVL to obtain the results, using the following commands: ```shell sh shells/eval_internvl.sh python calculate_scores.py --outputs-dir ./outputs/ ``` If you want to reproduce the results of InternVL-1.5-RAG, please first prepare the retrieved segments using the following commands: ```shell sh shells/prepare_rag.sh ``` Then, run these commands to obtain the results of InternVL-1.5-RAG: ```shell sh shells/eval_internvl_rag.sh python calculate_scores.py --outputs-dir ./outputs/ ``` `NOTE`: Make sure that you install the [flash-attention](https://github.com/Dao-AILab/flash-attention) successfully, otherwise you will meet the torch.cuda.OutOfMemoryError. ## Leaderboard 🚨🚨 The leaderboard is continuously being updated. To submit your results to the leaderboard on MM-NIAH, please send to [this email](mailto:wangweiyun@pjlab.org.cn) with your result jsonl files on each task, referring to the template file [example-retrieval-text.jsonl](https://github.com/OpenGVLab/MM-NIAH/outputs_example/example-retrieval-text.jsonl). Please organize the result jsonl files as follows: ``` ├── ${model_name}_retrieval-text-val.jsonl ├── ${model_name}_retrieval-image-val.jsonl ├── ${model_name}_counting-text-val.jsonl ├── ${model_name}_counting-image-val.jsonl ├── ${model_name}_reasoning-text-val.jsonl ├── ${model_name}_reasoning-image-val.jsonl ├── ├── ${model_name}_retrieval-text-test.jsonl ├── ${model_name}_retrieval-image-test.jsonl ├── ${model_name}_counting-text-test.jsonl ├── ${model_name}_counting-image-test.jsonl ├── ${model_name}_reasoning-text-test.jsonl └── ${model_name}_reasoning-image-test.jsonl ``` ## Visualization If you want to visualize samples in MM-NIAH, please install `gradio==3.43.2` and run this script [visualization.py](https://github.com/OpenGVLab/MM-NIAH/visualization.py). ## Data Format ```python { # int, starting from 0, each task type has independent ids. "id": xxx, # List of length N, where N is the number of images. Each element is a string representing the relative path of the image. The image contained in the "choices" is not included here, only the images in the "context" and "question" are recorded. "images_list": [ "xxx", "xxx", "xxx" ], # str, multimodal haystack, "" is used as the image placeholder. "context": "xxx", # str, question "question": "xxx", # Union[str, int, List], records the standard answer. Open-ended questions are str or List (counting task), multiple-choice questions are int "answer": "xxx", # meta_info, records various statistics "meta": { # Union[float, List[float]], range [0,1], position of the needle. If multiple needles are inserted, it is List[float]. "placed_depth": xxx, # int, number of text and visual tokens "context_length": xxx, # int, number of text tokens "context_length_text": xxx, # int, number of image tokens "context_length_image": xxx, # int, number of images "num_images": xxx, # List[str], inserted needles. If it is a text needle, record the text; if it is an image needle, record the relative path of the image. "needles": [xxx, ..., xxx], # List[str], candidate text answers. If it is not a multiple-choice question or there are no text candidates, write None. "choices": [xxx, ..., xxx], # List[str], candidate image answers. The relative path of the image. If it is not a multiple-choice question or there are no image candidates, write None. "choices_image_path": [xxx, ..., xxx], } } ``` `NOTE 1`: The number of `` in the context and question equates to the length of the `images_list`. `NOTE 2`: Save as a jsonl file, each line is a `Dict`. ## Contact - Weiyun Wang: wangweiyun@pjlab.org.cn - Wenhai Wang: wangwenhai@pjlab.org.cn - Wenqi Shao: shaowenqi@pjlab.org.cn ## Acknowledgement The multimodal haystack of MM-NIAH is build upon the documents from [OBELICS](https://github.com/huggingface/OBELICS). Besides, our project page is adapted from [Nerfies](https://github.com/nerfies/nerfies.github.io) and [MathVista](https://github.com/lupantech/MathVista). Thanks for their awesome work! ## Citation ```BibTex @article{wang2024needle, title={Needle In A Multimodal Haystack}, author={Wang, Weiyun and Zhang, Shuibo and Ren, Yiming and Duan, Yuchen and Li, Tiantong and Liu, Shuo and Hu, Mengkang and Chen, Zhe and Zhang, Kaipeng and Lu, Lewei and others}, journal={arXiv preprint arXiv:2406.07230}, year={2024} } ```

提供机构：

maas

创建时间：

2024-12-26

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4099个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

中国劳动力动态调查

“中国劳动力动态调查” （China Labor-force Dynamics Survey，简称 CLDS）是“985”三期“中山大学社会科学特色数据库建设”专项内容，CLDS的目的是通过对中国城乡以村/居为追踪范围的家庭、劳动力个体开展每两年一次的动态追踪调查，系统地监测村/居社区的社会结构和家庭、劳动力个体的变化与相互影响，建立劳动力、家庭和社区三个层次上的追踪数据库，从而为进行实证导向的高质量的理论研究和政策研究提供基础数据。

中国学术调查数据资料库收录

AIS数据集

该研究使用了多个公开的AIS数据集，这些数据集经过过滤、清理和统计分析。数据集涵盖了多种类型的船舶，并提供了关于船舶位置、速度和航向的关键信息。数据集包括来自19,185艘船舶的AIS消息，总计约6.4亿条记录。

github 收录

HazyDet

HazyDet是由解放军工程大学等机构创建的一个大规模数据集，专门用于雾霾场景下的无人机视角物体检测。该数据集包含383,000个真实世界实例，收集自自然雾霾环境和正常场景中人工添加的雾霾效果，以模拟恶劣天气条件。数据集的创建过程结合了深度估计和大气散射模型，确保了数据的真实性和多样性。HazyDet主要应用于无人机在恶劣天气条件下的物体检测，旨在提高无人机在复杂环境中的感知能力。

arXiv 收录

China Family Panel Studies (CFPS)

Please visit CFPS official data platform to download the newest data, WeChat official account of CFPS: ISSS_CFPS. The CFPS 2010 baseline survey conducted face-to-face interviews with the sampled households’ family members who live in the sample communities. It also interviewed those family members who were elsewhere in the same county. For those who were not present at home at the time of interview, basic information was collected from their family members at presence. All family members who had blood/marital/adoptive ties with the household were identified as permanent respondents. Prospective family members including new-borns and adopted children.

DataCite Commons 收录

全国兴趣点（POI）数据

POI（Point of Interest），即兴趣点，一个POI可以是餐厅、超市、景点、酒店、车站、停车场等。兴趣点通常包含四方面信息，分别为名称、类别、坐标、分类。其中，分类一般有一级分类和二级分类，每个分类都有相应的行业的代码和名称一一对应。 POI包含的信息及其衍生信息主要包含三个部分：

CnOpenData 收录