IMAGE_UNDERSTANDING

Name: IMAGE_UNDERSTANDING
Creator: maas
Published: 2025-09-01 16:39:10
License: 暂无描述

魔搭社区2025-09-01 更新2025-07-26 收录

下载链接：

https://modelscope.cn/datasets/microsoft/IMAGE_UNDERSTANDING

下载链接

链接失效反馈

官方服务：

资源简介：

A key question for understanding multimodal performance is analyzing the ability for a model to have basic vs. detailed understanding of images. These capabilities are needed for models to be used in real-world tasks, such as an assistant in the physical world. While there are many dataset for object detection and recognition, there are few that test spatial reasoning and other more targeted task such as visual prompting. The datasets that do exist are static and publicly available, thus there is concern that current AI models could be trained on these datasets, which makes evaluation with them unreliable. Thus we created a dataset that is procedurally generated and synthetic, and tests spatial reasoning, visual prompting, as well as object recognition and detection. The datasets are challenging for most AI models and by being procedurally generated the benchmark can be regenerated ad infinitum to create new test sets to combat the effects of models being trained on this data and the results being due to memorization. This dataset has 4 sub-tasks: Object Recognition, Visual Prompting. Spatial Rea- soning, and Object Detection. For each sub-task, the images consist of images of pasted objects on random images. The objects are from the COCO object list and are gathered from internet data. Each object is masked using the DeepLabV3 object detection model and then pasted on a random background from the Places365 dataset. The objects are pasted in one of four locations, top, left, bottom, and right, with small amounts of random rotation, positional jitter, and scale. There are 2 conditions “ single” and “ pairs”, for images with one and two objects. Each test set uses 20 sets of object classes (either 20 single objects or 20 pairs of objects), with four potential locations and four backgrounds classes, and we sample 4 instances of object and background. This results in 1280 images per condition and sub-task. __Object Detection__ Answer type: Open-ended Example for "single": {"images": ["val\\banana\\left\\fire_station\\0000075_Places365_val_00030609.jpg"], "prompt": "You are an object detection model that aims to detect all the objects in the image.\n\nDefinition of Bounding Box Coordinates:\n\nThe bounding box coordinates (a, b, c, d) represent the normalized positions of the object within the image:\n\na: The x-coordinate of the top-left corner of the bounding box, expressed as a percentage of the image width. It indicates the position from the left side of the image to the object's left boundary. The a ranges from 0.00 to 1.00 with precision of 0.01.\nb: The y-coordinate of the top-left corner of the bounding box, expressed as a percentage of the image height. It indicates the position from the top of the image to the object's top boundary. The b ranges from 0.00 to 1.00 with precision of 0.01.\nc: The x-coordinate of the bottom-right corner of the bounding box, expressed as a percentage of the image width. It indicates the position from the left side of the image to the object's right boundary. The c ranges from 0.00 to 1.00 with precision of 0.01.\nd: The y-coordinate of the bottom-right corner of the bounding box, expressed as a percentage of the image height. It indicates the position from the top of the image to the object's bottom boundary. The d ranges from 0.00 to 1.00 with precision of 0.01.\n\nThe top-left of the image has coordinates (0.00, 0.00). The bottom-right of the image has coordinates (1.00, 1.00).\n\nInstructions:\n1. Specify any particular regions of interest within the image that should be prioritized during object detection.\n2. For all the specified regions that contain the objects, generate the object's category type, bounding box coordinates, and your confidence for the prediction. The bounding box coordinates (a, b, c, d) should be as precise as possible. Do not only output rough coordinates such as (0.1, 0.2, 0.3, 0.4).\n3. If there are more than one object of the same category, output all of them.\n4. Please ensure that the bounding box coordinates are not examples. They should really reflect the position of the objects in the image.\n5.\nReport your results in this output format:\n(a, b, c, d) - category for object 1 - confidence\n(a, b, c, d) - category for object 2 - confidence\n...\n(a, b, c, d) - category for object n - confidence."} Example for "pairs": {"images": ["val\\hair drier_broccoli\\left\\church-indoor\\0000030_0000059_Places365_val_00000401.jpg"], "prompt": "You are an object detection model that aims to detect all the objects in the image.\n\nDefinition of Bounding Box Coordinates:\n\nThe bounding box coordinates (a, b, c, d) represent the normalized positions of the object within the image:\n\na: The x-coordinate of the top-left corner of the bounding box, expressed as a percentage of the image width. It indicates the position from the left side of the image to the object's left boundary. The a ranges from 0.00 to 1.00 with precision of 0.01.\nb: The y-coordinate of the top-left corner of the bounding box, expressed as a percentage of the image height. It indicates the position from the top of the image to the object's top boundary. The b ranges from 0.00 to 1.00 with precision of 0.01.\nc: The x-coordinate of the bottom-right corner of the bounding box, expressed as a percentage of the image width. It indicates the position from the left side of the image to the object's right boundary. The c ranges from 0.00 to 1.00 with precision of 0.01.\nd: The y-coordinate of the bottom-right corner of the bounding box, expressed as a percentage of the image height. It indicates the position from the top of the image to the object's bottom boundary. The d ranges from 0.00 to 1.00 with precision of 0.01.\n\nThe top-left of the image has coordinates (0.00, 0.00). The bottom-right of the image has coordinates (1.00, 1.00).\n\nInstructions:\n1. Specify any particular regions of interest within the image that should be prioritized during object detection.\n2. For all the specified regions that contain the objects, generate the object's category type, bounding box coordinates, and your confidence for the prediction. The bounding box coordinates (a, b, c, d) should be as precise as possible. Do not only output rough coordinates such as (0.1, 0.2, 0.3, 0.4).\n3. If there are more than one object of the same category, output all of them.\n4. Please ensure that the bounding box coordinates are not examples. They should really reflect the position of the objects in the image.\n5.\nReport your results in this output format:\n(a, b, c, d) - category for object 1 - confidence\n(a, b, c, d) - category for object 2 - confidence\n...\n(a, b, c, d) - category for object n - confidence."} __Object Recognition__ Answer type: Open-ended Example for "single" {"images": ["val\\potted plant\\left\\ruin\\0000097_Places365_val_00018147.jpg"], "prompt": "What objects are in this image?", "ground_truth": "potted plant"} Example for "pairs": {"images": ["val\\bottle_keyboard\\left\\ruin\\0000087_0000069_Places365_val_00035062.jpg"], "prompt": "What objects are in this image?", "ground_truth": "['bottle', 'keyboard']"} __Spatial Reasoning__ Answer type: Multiple Choice Example for "single" {"images": ["val\\potted plant\\left\\ruin\\0000097_Places365_val_00018147.jpg"], "query_text": "Is the potted plant on the right, top, left, or bottom of the image?\nAnswer with one of (right, bottom, top, or left) only.", "target_text": "left"} Example for "pairs" {"images": ["val\\bottle_keyboard\\left\\ruin\\0000087_0000069_Places365_val_00035062.jpg"], "query_text": "Is the bottle above, below, right, or left of the keyboard in the image?\nAnswer with one of (below, right, left, or above) only.", "target_text": "left"} What are the evaluation disaggregation pivots/attributes to run metrics for? Disaggregation by (group by): "single": (left, right, top, bottom) "pairs": (left, right, above, below) __Visual Prompting__ Answer type: Open-ended Example for "single" {"images": ["val\\potted plant\\left\\ruin\\0000097_Places365_val_00018147.jpg"], "prompt": "What objects are in this image?", "ground_truth": "potted plant"} Example for "pairs": {"images": ["val\\sheep_banana\\left\\landfill\\0000099_0000001_Places365_val_00031238.jpg"], "prompt": "What objects are in the red and yellow box in this image?", "ground_truth": "['sheep', 'banana']"}

理解多模态性能的核心问题之一，是分析模型对图像的基础理解（basic understanding）与细节理解（detailed understanding）能力。这类能力是模型落地真实世界任务（如物理世界智能助手）的必要前提。当前已有诸多用于目标检测与识别的数据集，但专门测试空间推理（spatial reasoning）及视觉提示（visual prompting）等针对性任务的数据集仍寥寥无几。现有公开静态数据集存在显著缺陷：当前AI模型可能已在这些数据集上完成训练，导致基于此类数据集的评估结果可靠性不足。为此，我们构建了一套程序生成式合成数据集，用于测试空间推理、视觉提示、目标识别与检测能力。该数据集对多数AI模型均具有挑战性，且由于采用程序生成方式，可无限次生成全新测试集，以规避模型在训练数据上的记忆效应对评估结果的干扰。本数据集包含4个子任务：目标识别、视觉提示、空间推理与目标检测。对于每个子任务，图像均由粘贴于随机背景图像上的目标构成。目标取自COCO目标列表，其源数据来自互联网。每个目标均通过DeepLabV3目标检测模型生成掩码，随后粘贴至来自Places365数据集的随机背景图像上。目标被粘贴于图像的四个固定位置之一：顶部、左侧、底部与右侧，并伴随小幅随机旋转、位置抖动与尺度变换。数据集包含“单目标（single）”与“双目标（pairs）”两种图像条件，分别对应单目标与双目标图像。每个测试集使用20组目标类别（20个单目标或20组双目标），并提供四种潜在位置与四类背景类别，同时对目标与背景进行4次采样。最终每个条件下的每个子任务均包含1280张图像。 __目标检测（Object Detection）__ 回答类型：开放式回答 “单目标”示例： {"images": ["val\banana\left\fire_station\0000075_Places365_val_00030609.jpg"], "prompt": "你是一款目标检测模型，需检测图像中的所有目标。边界框坐标定义：边界框坐标(a, b, c, d)表示目标在图像中的归一化位置： a：边界框左上角的x坐标，以图像宽度的百分比表示，代表从图像左侧到目标左边界的位置，取值范围为0.00至1.00，精度为0.01。 b：边界框左上角的y坐标，以图像高度的百分比表示，代表从图像顶部到目标上边界的位置，取值范围为0.00至1.00，精度为0.01。 c：边界框右下角的x坐标，以图像宽度的百分比表示，代表从图像左侧到目标右边界的位置，取值范围为0.00至1.00，精度为0.01。 d：边界框右下角的y坐标，以图像高度的百分比表示，代表从图像顶部到目标下边界的位置，取值范围为0.00至1.00，精度为0.01。图像左上角坐标为(0.00, 0.00)，右下角坐标为(1.00, 1.00)。操作要求： 1. 指定图像中需在目标检测过程中优先关注的感兴趣区域。 2. 针对所有包含目标的指定区域，输出目标类别、边界框坐标以及本次预测的置信度。边界框坐标(a, b, c, d)需尽可能精确，不得仅输出如(0.1, 0.2, 0.3, 0.4)这类粗略坐标。 3. 若存在多个同类目标，请逐一输出。 4. 请确保边界框坐标符合图像中目标的实际位置，不得使用示例坐标。 5. 请按照以下格式输出结果： (a, b, c, d) - 目标1的类别 - 置信度 (a, b, c, d) - 目标2的类别 - 置信度 ... (a, b, c, d) - 目标n的类别 - 置信度。"} “双目标”示例： {"images": ["val\hair drier_broccoli\left\church-indoor\0000030_0000059_Places365_val_00000401.jpg"], "prompt": "你是一款目标检测模型，需检测图像中的所有目标。边界框坐标定义：边界框坐标(a, b, c, d)表示目标在图像中的归一化位置： a：边界框左上角的x坐标，以图像宽度的百分比表示，代表从图像左侧到目标左边界的位置，取值范围为0.00至1.00，精度为0.01。 b：边界框左上角的y坐标，以图像高度的百分比表示，代表从图像顶部到目标上边界的位置，取值范围为0.00至1.00，精度为0.01。 c：边界框右下角的x坐标，以图像宽度的百分比表示，代表从图像左侧到目标右边界的位置，取值范围为0.00至1.00，精度为0.01。 d：边界框右下角的y坐标，以图像高度的百分比表示，代表从图像顶部到目标下边界的位置，取值范围为0.00至1.00，精度为0.01。图像左上角坐标为(0.00, 0.00)，右下角坐标为(1.00, 1.00)。操作要求： 1. 指定图像中需在目标检测过程中优先关注的感兴趣区域。 2. 针对所有包含目标的指定区域，输出目标类别、边界框坐标以及本次预测的置信度。边界框坐标(a, b, c, d)需尽可能精确，不得仅输出如(0.1, 0.2, 0.3, 0.4)这类粗略坐标。 3. 若存在多个同类目标，请逐一输出。 4. 请确保边界框坐标符合图像中目标的实际位置，不得使用示例坐标。 5. 请按照以下格式输出结果： (a, b, c, d) - 目标1的类别 - 置信度 (a, b, c, d) - 目标2的类别 - 置信度 ... (a, b, c, d) - 目标n的类别 - 置信度。"} __目标识别（Object Recognition）__ 回答类型：开放式回答 “单目标”示例： {"images": ["val\potted plant\left\ruin\0000097_Places365_val_00018147.jpg"], "prompt": "该图像中包含哪些物体？", "ground_truth": "盆栽植物"} “双目标”示例： {"images": ["val\bottle_keyboard\left\ruin\0000087_0000069_Places365_val_00035062.jpg"], "prompt": "该图像中包含哪些物体？", "ground_truth": "['瓶子', '键盘']"} __空间推理（Spatial Reasoning）__ 回答类型：单项选择 “单目标”示例： {"images": ["val\potted plant\left\ruin\0000097_Places365_val_00018147.jpg"], "query_text": "该盆栽植物位于图像的右侧、顶部、左侧还是底部？请仅从(right, bottom, top, left)中选择一个答案。", "target_text": "left"} “双目标”示例： {"images": ["val\bottle_keyboard\left\ruin\0000087_0000069_Places365_val_00035062.jpg"], "query_text": "图像中的瓶子位于键盘的上方、下方、右侧还是左侧？请仅从(below, right, left, above)中选择一个答案。", "target_text": "left"} 评估的细分维度/属性：按以下维度分组进行指标计算： "单目标": (左侧、右侧、顶部、底部) "双目标": (左侧、右侧、上方、下方) __视觉提示（Visual Prompting）__ 回答类型：开放式回答 “单目标”示例： {"images": ["val\potted plant\left\ruin\0000097_Places365_val_00018147.jpg"], "prompt": "该图像中包含哪些物体？", "ground_truth": "盆栽植物"} “双目标”示例： {"images": ["val\sheep_banana\left\landfill\0000099_0000001_Places365_val_00031238.jpg"], "prompt": "该图像中红色与黄色框内包含哪些物体？", "ground_truth": "['绵羊', '香蕉']"}

提供机构：

maas

创建时间：

2025-07-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集