BLINK

Name: BLINK
Creator: maas
Published: 2026-05-16 19:22:03
License: 暂无描述

魔搭社区2026-05-16 更新2025-09-13 收录

下载链接：

https://modelscope.cn/datasets/evalscope/BLINK

下载链接

链接失效反馈

官方服务：

资源简介：

# BLINK: Multimodal Large Language Models Can See but Not Perceive [**🌐 Homepage**](https://zeyofu.github.io/blink/) | [**💻 Code**](https://github.com/zeyofu/BLINK_Benchmark) | [**📖 Paper**](https://arxiv.org/abs/2404.12390.pdf) | [**📖 arXiv**](https://arxiv.org/abs/2404.12390) | [**🔗 Eval AI**](https://eval.ai/web/challenges/challenge-page/2287/overview) This page contains the benchmark dataset for the paper "[BLINK: Multimodal Large Language Models Can See but Not Perceive](https://arxiv.org/abs/2404.12390.pdf)" ## Introduction We introduce **BLINK**, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the **BLINK** tasks can be solved by humans “within a blink” (e.g., *relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning*). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. **BLINK** reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, **BLINK** is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not “emerged” yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe **BLINK** will stimulate the community to help multimodal LLMs catch up with human-level visual perception. ![Alt text](assets/teaser.png) ## Dataset Creation BLINK is created to challenge multimodal models on hollistic visual perception abilities with tasks inherited from classic computer vision problems, stimulating future development of multimodal LLMs that achieve human-level visual perception. **Unique Features** of BLINK includes diverse visual prompting, beyond recognition perception, and visual commonsense.  ## Load Dataset ``` import datasets dataset_name = 'BLINK-Benchmark/BLINK' data = load_dataset(dataset_name, SUBTASK_NAME) ``` where `SUBTASK_NAME` is one of the subtasks: `['Art_Style', 'Functional_Correspondence', 'Multi-view_Reasoning', 'Relative_Reflectance', 'Visual_Correspondence', 'Counting', 'IQ_Test', 'Object_Localization', 'Semantic_Correspondence', 'Visual_Similarity', 'Forensic_Detection', 'Jigsaw', 'Relative_Depth', 'Spatial_Relation']` ## 🏆 Mini-Leaderboard | Model | Val (1,901) | Test (1,907) | |----------------------------|:-----------:|:------------:| |🏅 Human | 95.7 | 95.7 | |🥈 GPT-4o | 60.0 | **59.0** | |🥉 GPT-4 Turbo | 54.6 | 53.9 | | GPT-4V(ision preview) | 51.1 | 51.3 | | Gemini Pro | 45.2 | 45.7 | | LLaVA-1.6-34B | 46.8 | 45.1 | | Claude 3 Opus | 44.1 | 44.1 | | Yi-VL-34B | 41.7 | 42.8 | | Qwen-VL-MAX | 40.3 | 42.0 | | LLaVA-v1.5-13B-xtuner | 42.0 | 41.3 | | Yi-VL-6B | 38.7 | 41.7 | | LLaVA-v1.5-7B-xtuner | 39.4 | 40.8 | | LLaVA-1.5-13B | 42.7 | 40.6 | | InstructBLIP-13B | 42.2 | 39.6 | | CogVLM | 41.5 | 39.4 | | InstructBLIP-7B | 39.7 | 38.7 | | OpenFlamingo2-9B | 39.2 | 38.3 | |👀 **Random Choice** | 38.1 | 38.1 | | LLaVA-1.5-7B | 37.1 | 38.0 | | LLaVA-internLM2-7B | 37.7 | 36.1 | | MiniGPT-4-v2-7B | 34.2 | 34.6 | <img src="assets/radar_v1.png" width="400" /> 🎯 **We have released a full suite comprising 1,901 validation samples, the prompts we used, and [model predictions](https://github.com/zeyofu/BLINK_Benchmark/tree/main/eval) for the baselines tested in our paper. However, the 1,907 test questions are available without their answers.** You can submit your model's predictions for the **test set** on **[EvalAI](https://eval.ai/web/challenges/challenge-page/2287/overview)**. ## Disclaimers Blink makes use of data from existing image datasets, and does not cover all the visual perception abilities in the wild. For the forensics detection task, we manually collected images that are publicly available from online search. We have made every effort to ensure that the images included in this paper are used in accordance with applicable copyright laws and are properly credited. However, if you are the copyright owner of any image included in our work and believe that its use conflicts with your licensing agreements, please [contact](#contact) us directly. We are committed to addressing any legitimate concerns promptly. ## Contact - Xingyu Fu: xingyuf2@seas.upenn.edu - Yushi Hu: yushihu@uw.edu - Wei-Chiu Ma: weichiu@cornell.edu - Ranjay Krishna: ranjay@cs.washington.edu ## Citation **BibTeX:** ```bibtex @article{fu2024blink, title={BLINK: Multimodal Large Language Models Can See but Not Perceive}, author={Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A and Ma, Wei-Chiu and Krishna, Ranjay}, journal={arXiv preprint arXiv:2404.12390}, year={2024} } ``` Daily Paper: https://huggingface.co/papers/2404.12390

# BLINK：多模态大语言模型可视而未能感知 [🌐 主页](https://zeyofu.github.io/blink/) | [💻 代码](https://github.com/zeyofu/BLINK_Benchmark) | [📖 论文](https://arxiv.org/abs/2404.12390.pdf) | [📖 arXiv](https://arxiv.org/abs/2404.12390) | [🔗 Eval AI](https://eval.ai/web/challenges/challenge-page/2287/overview) 本页面包含论文《BLINK：多模态大语言模型可视而未能感知》对应的基准数据集。 ## 简介我们提出**BLINK**，一款面向多模态大语言模型（Multimodal Large Language Model）的全新基准测试集，聚焦于其他评估方案未曾涵盖的核心视觉感知能力。**BLINK**的绝大多数任务人类可“转瞬之间”完成（例如：相对深度估计（relative depth estimation）、视觉对应（visual correspondence）、取证检测（forensics detection）与多视图推理（multi-view reasoning））。然而我们发现，这类依赖感知能力的任务对当前的多模态大语言模型构成了显著挑战——它们难以通过自然语言交互完成。**BLINK**将14项经典计算机视觉（Computer Vision, CV）任务重构为3807道多项选择题，配套单张或多张图像与视觉提示。尽管人类平均准确率可达95.70%，但**BLINK**对现有多模态大语言模型而言却极具挑战性：即便表现最优的GPT-4V与Gemini，准确率也仅为51.26%和45.72%，仅比随机猜测（random guessing）高13.17%与7.63%，这表明近期的多模态大语言模型尚未涌现此类感知能力。我们的分析还表明，专用CV模型可更好地完成这些任务，这为未来的改进方向提供了思路。我们相信**BLINK**将推动学界助力多模态大语言模型追上人类级别的视觉感知水平。 ![Alt text](assets/teaser.png) ## 数据集构建 **BLINK**旨在通过源自经典计算机视觉任务的样本，挑战多模态模型的全局视觉感知能力，以推动具备人类级别视觉感知能力的多模态大语言模型的发展。**BLINK**的独特特性包括多样化视觉提示、超越识别的感知能力以及视觉常识。  ## 加载数据集 import datasets dataset_name = 'BLINK-Benchmark/BLINK' data = load_dataset(dataset_name, SUBTASK_NAME) 其中`SUBTASK_NAME`为以下子任务之一：`['Art_Style（艺术风格）', 'Functional_Correspondence（功能对应）', 'Multi-view_Reasoning（多视图推理）', 'Relative_Reflectance（相对反射率）', 'Visual_Correspondence（视觉对应）', 'Counting（计数）', 'IQ_Test（智商测试）', 'Object_Localization（目标定位）', 'Semantic_Correspondence（语义对应）', 'Visual_Similarity（视觉相似度）', 'Forensic_Detection（取证检测）', 'Jigsaw（拼图）', 'Relative_Depth（相对深度）', 'Spatial_Relation（空间关系）']` ## 🏆 迷你排行榜 | 模型 | 验证集（1,901） | 测试集（1,907） | |----------------------------|:-----------:|:------------:| |🏅 人类 | 95.7 | 95.7 | |🥈 GPT-4o | 60.0 | **59.0** | |🥉 GPT-4 Turbo | 54.6 | 53.9 | | GPT-4V（视觉预览版） | 51.1 | 51.3 | | Gemini Pro | 45.2 | 45.7 | | LLaVA-1.6-34B | 46.8 | 45.1 | | Claude 3 Opus | 44.1 | 44.1 | | Yi-VL-34B | 41.7 | 42.8 | | Qwen-VL-MAX | 40.3 | 42.0 | | LLaVA-v1.5-13B-xtuner | 42.0 | 41.3 | | Yi-VL-6B | 38.7 | 41.7 | | LLaVA-v1.5-7B-xtuner | 39.4 | 40.8 | | LLaVA-1.5-13B | 42.7 | 40.6 | | InstructBLIP-13B | 42.2 | 39.6 | | CogVLM | 41.5 | 39.4 | | InstructBLIP-7B | 39.7 | 38.7 | | OpenFlamingo2-9B | 39.2 | 38.3 | |👀 **随机猜测（Random Choice）** | 38.1 | 38.1 | | LLaVA-1.5-7B | 37.1 | 38.0 | | LLaVA-internLM2-7B | 37.7 | 36.1 | | MiniGPT-4-v2-7B | 34.2 | 34.6 | <img src="assets/radar_v1.png" width="400" /> 🎯 **我们已发布完整套件，包含1901个验证集样本、我们使用的提示词，以及[模型预测结果](https://github.com/zeyofu/BLINK_Benchmark/tree/main/eval)，对应本文中测试的所有基线模型。不过1907道测试题的答案并未公开。** 您可在**[EvalAI平台](https://eval.ai/web/challenges/challenge-page/2287/overview)**上提交您的模型在**测试集**上的预测结果。 ## 免责声明本数据集使用了现有图像数据集的数据，并未涵盖野外场景下的所有视觉感知能力。对于取证检测任务，我们通过网络搜索手动收集了公开可用的图像。我们已尽最大努力确保本文中使用的图像符合适用的版权法，并已进行适当的署名。但若您是本作品中使用的任何图像的版权所有者，并认为其使用与您的许可协议冲突，请直接[联系](#contact)我们。我们承诺将及时处理任何合法诉求。 ## 联系方式 - 傅星宇：xingyuf2@seas.upenn.edu - 胡宇诗：yushihu@uw.edu - 马维舟：weichiu@cornell.edu - 兰杰·克里希纳：ranjay@cs.washington.edu ## 引用 **BibTeX格式：** bibtex @article{fu2024blink, title={BLINK: Multimodal Large Language Models Can See but Not Perceive}, author={Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A and Ma, Wei-Chiu and Krishna, Ranjay}, journal={arXiv preprint arXiv:2404.12390}, year={2024} } 每日论文链接：https://huggingface.co/papers/2404.12390

提供机构：

maas

创建时间：

2025-09-12

搜集汇总

数据集介绍