Do-You-See-Me

Name: Do-You-See-Me
Creator: maas
Published: 2025-12-05 12:12:29
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-26 收录

下载链接：

https://modelscope.cn/datasets/microsoft/Do-You-See-Me

下载链接

链接失效反馈

官方服务：

资源简介：

# DoYouSeeMe <div style="display: flex; justify-content: space-between;"> <img src="img/main_fig.png" width="100%" alt="Results on Do You See Me"> </div> ## Overview The DoYouSeeMe benchmark is a comprehensive evaluation framework designed to assess visual perception capabilities in Machine Learning Language Models (MLLMs). This fully automated test suite dynamically generates both visual stimuli and perception-focused questions (VPQA) with incremental difficulty levels, enabling a graded evaluation of MLLM performance across multiple perceptual dimensions. Our benchmark consists of both 2D and 3D photorealistic evaluations of MLLMs. ## Theoretical Foundation The dataset's structure is grounded in established human psychological frameworks that categorize visual perception into core abilities (Chalfant and Scheffelin, 1969). Drawing inspiration from standardized assessments like the Test of Visual Perception Skills (TVPS) (Gardner, 1988) and Motor-Free Visual Perception Test (MVPT) (Colarusso, 2003), DoYouSeeMe adapts these principles to create a systematic evaluation methodology for machine vision systems. ## Perceptual Dimensions The benchmark focuses on seven key dimensions of visual perception: 1. **Shape Discrimination (2D and 3D)**: Evaluates the ability to recognize shapes. 2. **Joint Shape-Color Discrimination (2D and 3D)**: Evaluates the ability to jointly recognize shapes and color. 3. **Visual Form Constancy (2D and 3D)**: Tests MLLM ability to identify a test shape configuration from similarly placed disctractors. 4. **Letter Disambiguation (2D and 3D)**: Tests the recognition of letters. 5. **Visual Figure-Ground (2D)**: Evaluates the ability to distinguish the main object from its background under varying conditions. 6. **Visual Closure (2D)**: Assesses the ability to complete partially obscured shapes by mentally filling in missing information. 7. **Visual Spatial (2D and 3D)**: Examines the ability to perceive positions of objects relative to oneself and to other objects. Note: While human visual perception also includes Visual Memory (the ability to remember sequences of presented images), this dimension is omitted from the benchmark as current MLLMs lack short-term visual memory capabilities beyond textual descriptions. ## Technical Implementation The entire dataset generation framework is implemented in Python and uses SVG representations to create visual stimuli with precisely controlled parameters. This approach allows for: - Dynamic generation of test images with systematic variations - Controlled difficulty progression across perception dimensions - Reproducible evaluation conditions - Fine-grained assessment of model performance ### Control Parameters <div style="display: flex; justify-content: space-between;"> <img src="img/control_param_syn_dataset.png" width="100%" alt="Results on Do You See Me"> </div> The code is open-sourced to facilitate further research and advancement in the field of visual perception for artificial intelligence systems. Paper: [DoYouSeeMe Benchmark on arXiv](https://arxiv.org/pdf/2506.02022) Code: [DoYouSeeMe Github Repo](https://github.com/microsoft/Do-You-See-Me) ## Samples ### Visual Spatial Tests the ability to perceive and understand spatial relationships between objects. Evaluates orientation discrimination and positional awareness. <div style="display: grid; grid-template-columns: repeat(3, 1fr); gap: 15px; align-items: start;"> <img src="2D_DoYouSeeMe/visual_spatial/1.png" style="width: 100%; height: auto;" alt="Visual Spatial Example 1"> <img src="2D_DoYouSeeMe/visual_spatial/50.png" style="width: 100%; height: auto;" alt="Visual Spatial Example 2"> <img src="2D_DoYouSeeMe/visual_spatial/100.png" style="width: 100%; height: auto;" alt="Visual Spatial Example 3"> </div> *Sample Question: Starting from the black circle at position (row 1, column 3), how many triangles are there bottom of it in the same row?* ### Visual Figure-Ground Examines the ability to distinguish an object from its background. Challenges perception by varying contrast, noise, and complexity. <div style="display: flex; justify-content: space-between;"> <img src="2D_DoYouSeeMe/visual_figure_ground/1.png" width="30%" alt="Figure-Ground Example 1"> <img src="2D_DoYouSeeMe/visual_figure_ground/50.png" width="30%" alt="Figure-Ground Example 2"> <img src="2D_DoYouSeeMe/visual_figure_ground/89.png" width="30%" alt="Figure-Ground Example 3"> </div> *Sample Question: The figure consists of a Target image, which is embedded in some background noise. Out of the four given options, your task is to pick the option which has the same figure as the target image. Respond as follows: Option <your answer (choose between 1, 2, 3, or 4)>.* ### Visual Form Constancy Assesses recognition of shapes despite changes in size, orientation, or context. Tests invariance in visual perception. <div style="display: flex; justify-content: space-between;"> <img src="2D_DoYouSeeMe/visual_form_constancy/1.png" width="30%" alt="Form Constancy Example 1"> <img src="2D_DoYouSeeMe/visual_form_constancy/50.png" width="30%" alt="Form Constancy Example 2"> <img src="2D_DoYouSeeMe/visual_form_constancy/100.png" width="30%" alt="Form Constancy Example 3"> </div> *Sample Question: The figure consists of a Target image. Out of the four given options, your task is to pick the option which has the same figure as the target image. Respond as follows: Option <your answer (choose between 1, 2, 3, or 4)>.* ### Shape Disambiguation Challenges the ability to identify ambiguous shapes that can be interpreted in multiple ways. Explores perceptual flexibility. <div style="display: flex; justify-content: space-between;"> <img src="2D_DoYouSeeMe/geometric_dataset/1.png" width="30%" alt="Shape Disambiguation Example 1"> <img src="2D_DoYouSeeMe/geometric_dataset/50.png" width="30%" alt="Shape Disambiguation Example 2"> <img src="2D_DoYouSeeMe/geometric_dataset/100.png" width="30%" alt="Shape Disambiguation Example 3"> </div> *Sample Question: Count the total number of triangles in the image, including each concentric triangle separately. For example, if there is one triangle with 2 inner concentric rings, that counts as 3 triangles. Respond with only a number.* ### Shape Color Discrimination Tests the ability to differentiate shapes based on color properties while controlling for other visual features. <div style="display: flex; justify-content: space-between;"> <img src="2D_DoYouSeeMe/color_and_shape_disambiguation/1.png" width="30%" alt="Shape Color Example 1"> <img src="2D_DoYouSeeMe/color_and_shape_disambiguation/50.png" width="30%" alt="Shape Color Example 2"> <img src="2D_DoYouSeeMe/color_and_shape_disambiguation/89.png" width="30%" alt="Shape Color Example 3"> </div> *Sample Question: Count the number of star's that are red.* ### Letter Disambiguation Examines recognition of letters under various transformations and distortions. Evaluates robustness of character recognition. <div style="display: grid; grid-template-columns: repeat(3, 1fr); gap: 15px; align-items: start;"> <img src="2D_DoYouSeeMe/letter_disambiguation/1.png" style="width: 100%; height: auto;" alt="Letter Disambiguation Example 1"> <img src="2D_DoYouSeeMe/letter_disambiguation/50.png" style="width: 100%; height: auto;" alt="Letter Disambiguation Example 2"> <img src="2D_DoYouSeeMe/letter_disambiguation/100.png" style="width: 100%; height: auto;" alt="Letter Disambiguation Example 3"> </div> *Sample Question: The image shows one or more letters formed by a grid of small squares. What letter(s) can you identify in this image? Please respond with only the letter(s) you see.* ### Visual Closure Tests the ability to recognize incomplete figures by mentally filling in missing information. Evaluates gestalt processing. <div style="display: flex; justify-content: space-between;"> <img src="2D_DoYouSeeMe/visual_closure/1.png" width="30%" alt="Visual Closure Example 1"> <img src="2D_DoYouSeeMe/visual_closure/50.png" width="30%" alt="Visual Closure Example 2"> <img src="2D_DoYouSeeMe/visual_closure/100.png" width="30%" alt="Visual Closure Example 3"> </div> *Sample Question: The figure consists of a target image which is complete, Out of the four given options (which are partially complete), your task is to pick the option which when completed matches the target image. Respond as follows: Option <your answer (choose between 1, 2, 3, or 4)>.* ## Citation If you use this benchmark or dataset in your research, please cite our work as follows: ``` @misc{kanade2025multidimensionalbenchmarkevaluating, title={Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs}, author={Aditya Kanade and Tanuja Ganu}, year={2025}, eprint={2506.02022}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.02022}, } ``` ## Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies. ## License 📜 The **code** in this repository is licensed under the [MIT License](https://opensource.org/licenses/MIT). The **dataset** is licensed under the [Community Data License Agreement - Permissive - Version 2.0 (CDLA-Permissive-2.0)](https://cdla.dev/permissive-2-0/).

# DoYouSeeMe基准数据集 <div style="display: flex; justify-content: space-between;"> <img src="img/main_fig.png" width="100%" alt="DoYouSeeMe基准测试结果"> </div> ## 概述 DoYouSeeMe基准是一套综合性评估框架，旨在测评机器学习语言模型（Machine Learning Language Models, MLLMs）的视觉感知能力。这套全自动化测试套件可动态生成视觉刺激材料与感知聚焦问题（Perception-Focused Questions, VPQA），并设置递增的难度层级，从而能够在多个感知维度上对MLLM的性能进行分级评估。本基准同时包含2D与3D的逼真视觉评估场景，用于测评MLLM。 ## 理论基础本数据集的构建逻辑依托于成熟的人类心理学框架，该框架将视觉感知划分为多项核心能力（Chalfant与Scheffelin, 1969）。研究团队参考了《视觉感知技能测试》（Test of Visual Perception Skills, TVPS）（Gardner, 1988）与《无运动视觉感知测试》（Motor-Free Visual Perception Test, MVPT）（Colarusso, 2003）等标准化评估工具，将其原理适配后，为机器视觉系统构建了一套系统化的评估方法。 ## 感知维度本基准聚焦于七项核心视觉感知维度： 1. **形状辨别（2D与3D）**：评估模型识别图形的能力。 2. **形状与颜色联合辨别（2D与3D）**：评估模型同时识别形状与颜色的能力。 3. **视觉形状恒常性（2D与3D）**：测试MLLM从相似干扰项中识别目标图形构型的能力。 4. **字母歧义辨别（2D与3D）**：测试模型对字母的识别能力。 5. **视觉背景分离（2D）**：评估模型在不同条件下将主体物体与背景区分开的能力。 6. **视觉闭合（2D）**：评估模型通过心理补全缺失信息来识别被部分遮挡图形的能力。 7. **视觉空间感知（2D与3D）**：检验模型感知自身与其他物体、以及物体间相对位置的能力。 > 注意：人类视觉感知还包含视觉记忆（即记忆所呈现图像序列的能力），但由于当前MLLM仅能基于文本描述实现有限的短期视觉记忆，因此本基准未包含该维度。 ## 技术实现整套数据集生成框架基于Python实现，采用可缩放矢量图形（Scalable Vector Graphics, SVG）来创建参数精确可控的视觉刺激材料。该方法可实现： - 系统变异下的测试图像动态生成 - 跨感知维度的难度可控递进 - 可复现的评估条件 - 对模型性能的精细化评估 ### 控制参数 <div style="display: flex; justify-content: space-between;"> <img src="img/control_param_syn_dataset.png" width="100%" alt="控制参数示意图"> </div> 本代码已开源，以推动人工智能系统视觉感知领域的进一步研究与发展。论文：[arXiv上的DoYouSeeMe基准论文](https://arxiv.org/pdf/2506.02022) 代码：[DoYouSeeMe GitHub仓库](https://github.com/microsoft/Do-You-See-Me) ## 示例样本 ### 视觉空间感知该任务测试模型感知并理解物体间空间关系的能力，评估其方向辨别与位置认知水平。 <div style="display: grid; grid-template-columns: repeat(3, 1fr); gap: 15px; align-items: start;"> <img src="2D_DoYouSeeMe/visual_spatial/1.png" style="width: 100%; height: auto;" alt="视觉空间感知示例1"> <img src="2D_DoYouSeeMe/visual_spatial/50.png" style="width: 100%; height: auto;" alt="视觉空间感知示例2"> <img src="2D_DoYouSeeMe/visual_spatial/100.png" style="width: 100%; height: auto;" alt="视觉空间感知示例3"> </div> *示例问题：从位于（第1行，第3列）的黑色圆点出发，同一行中位于其下方的三角形共有多少个？* ### 视觉背景分离该任务检验模型将物体与背景区分开的能力，通过改变对比度、添加噪声与提升复杂度来增加感知难度。 <div style="display: flex; justify-content: space-between;"> <img src="2D_DoYouSeeMe/visual_figure_ground/1.png" width="30%" alt="背景分离示例1"> <img src="2D_DoYouSeeMe/visual_figure_ground/50.png" width="30%" alt="背景分离示例2"> <img src="2D_DoYouSeeMe/visual_figure_ground/89.png" width="30%" alt="背景分离示例3"> </div> *示例问题：本测试包含一张嵌入背景噪声的目标图像。请从四个给定选项中选出与目标图像完全一致的图形。请按照以下格式作答：Option <你的答案，从1、2、3、4中选择>。* ### 视觉形状恒常性该任务评估模型在图形尺寸、方向或上下文发生变化时的识别能力，检验视觉感知的不变性。 <div style="display: flex; justify-content: space-between;"> <img src="2D_DoYouSeeMe/visual_form_constancy/1.png" width="30%" alt="形状恒常性示例1"> <img src="2D_DoYouSeeMe/visual_form_constancy/50.png" width="30%" alt="形状恒常性示例2"> <img src="2D_DoYouSeeMe/visual_form_constancy/100.png" width="30%" alt="形状恒常性示例3"> </div> *示例问题：本测试包含一张完整的目标图像。请从四个给定选项中选出与目标图像完全一致的图形。请按照以下格式作答：Option <你的答案，从1、2、3、4中选择>。* ### 形状歧义辨别该任务挑战模型识别可被多重解读的模糊图形的能力，探索其感知灵活性。 <div style="display: flex; justify-content: space-between;"> <img src="2D_DoYouSeeMe/geometric_dataset/1.png" width="30%" alt="形状歧义辨别示例1"> <img src="2D_DoYouSeeMe/geometric_dataset/50.png" width="30%" alt="形状歧义辨别示例2"> <img src="2D_DoYouSeeMe/geometric_dataset/100.png" width="30%" alt="形状歧义辨别示例3"> </div> *示例问题：统计图像中三角形的总数量，需分别计数每个同心嵌套的三角形。例如，若一个三角形带有2层内部同心环，则共计为3个三角形。请仅输出数字答案。* ### 形状与颜色联合辨别该任务测试模型在控制其他视觉特征的前提下，基于颜色属性区分图形的能力。 <div style="display: flex; justify-content: space-between;"> <img src="2D_DoYouSeeMe/color_and_shape_disambiguation/1.png" width="30%" alt="形状与颜色联合辨别示例1"> <img src="2D_DoYouSeeMe/color_and_shape_disambiguation/50.png" width="30%" alt="形状与颜色联合辨别示例2"> <img src="2D_DoYouSeeMe/color_and_shape_disambiguation/89.png" width="30%" alt="形状与颜色联合辨别示例3"> </div> *示例问题：统计红色五角星的数量。* ### 字母歧义辨别该任务检验模型在经历多种变换与失真后对字母的识别能力，评估字符识别的鲁棒性。 <div style="display: grid; grid-template-columns: repeat(3, 1fr); gap: 15px; align-items: start;"> <img src="2D_DoYouSeeMe/letter_disambiguation/1.png" style="width: 100%; height: auto;" alt="字母歧义辨别示例1"> <img src="2D_DoYouSeeMe/letter_disambiguation/50.png" style="width: 100%; height: auto;" alt="字母歧义辨别示例2"> <img src="2D_DoYouSeeMe/letter_disambiguation/100.png" style="width: 100%; height: auto;" alt="字母歧义辨别示例3"> </div> *示例问题：本图像由小方格网格构成一个或多个字母。请识别图像中的字母，仅输出你看到的字母即可。* ### 视觉闭合该任务测试模型通过心理补全缺失信息来识别不完整图形的能力，评估其格式塔（Gestalt）信息处理能力。 <div style="display: flex; justify-content: space-between;"> <img src="2D_DoYouSeeMe/visual_closure/1.png" width="30%" alt="视觉闭合示例1"> <img src="2D_DoYouSeeMe/visual_closure/50.png" width="30%" alt="视觉闭合示例2"> <img src="2D_DoYouSeeMe/visual_closure/100.png" width="30%" alt="视觉闭合示例3"> </div> *示例问题：本测试包含一张完整的目标图像。请从四个局部不完整的选项中选出补全后与目标图像一致的选项。请按照以下格式作答：Option <你的答案，从1、2、3、4中选择>。* ## 引用若您在研究中使用本基准或数据集，请按以下格式引用我们的工作： @misc{kanade2025multidimensionalbenchmarkevaluating, title={Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs}, author={Aditya Kanade and Tanuja Ganu}, year={2025}, eprint={2506.02022}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.02022}, } ## 商标声明本项目可能包含相关项目、产品或服务的商标或标识。微软商标与标识的合法使用需遵循[微软商标与品牌使用指南](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general)。在修改后的项目版本中使用微软商标与标识时，不得造成混淆或暗示微软为其背书。第三方商标与标识的使用需遵循其所属第三方的相关政策。 ## 许可证 📜 本仓库中的**代码**采用[MIT许可证](https://opensource.org/licenses/MIT)进行授权。本仓库中的**数据集**采用[社区数据许可协议 - 宽松版 - 2.0版（CDLA-Permissive-2.0）](https://cdla.dev/permissive-2-0/)进行授权。

提供机构：

maas

创建时间：

2025-07-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集