phiyodr/InpaintCOCO

Name: phiyodr/InpaintCOCO
Creator: phiyodr
Published: 2024-04-30 08:09:12
License: 暂无描述

Hugging Face2024-04-30 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/phiyodr/InpaintCOCO

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: InpaintCOCO language: - en size_categories: - 1K<n<10K task_categories: - image-to-text - text-to-image - image-classification task_ids: - image-captioning tags: - coco - image-captioning - inpainting - multimodel-understanding dataset_info: features: - name: concept dtype: string - name: coco_caption dtype: string - name: coco_image dtype: image - name: inpaint_caption dtype: string - name: inpaint_image dtype: image - name: mask dtype: image - name: worker dtype: string - name: coco_details struct: - name: captions sequence: string - name: coco_url dtype: string - name: date_captured dtype: string - name: flickr_url dtype: string - name: height dtype: int64 - name: id dtype: int64 - name: image_license dtype: string - name: text_license dtype: string - name: width dtype: int64 - name: inpaint_details struct: - name: duration dtype: int64 - name: guidance_scale dtype: float64 - name: num_inference_steps dtype: int64 - name: prompt dtype: string - name: prompts_used dtype: int64 - name: quality dtype: string - name: mask_details struct: - name: height_factor dtype: int64 - name: prompt dtype: string - name: prompts_used dtype: int64 - name: width_factor dtype: int64 splits: - name: test num_bytes: 1062104623.5 num_examples: 1260 download_size: 1055968442 dataset_size: 1062104623.5 configs: - config_name: default data_files: - split: test path: data/test-* --- # InpaintCOCO - Fine-grained multimodal concept understanding (for color, size, and COCO objects) ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary A data sample contains 2 images and 2 corresponding captions that differ only in one object, the color of an object, or the size of an object. > Many multimodal tasks, such as Vision-Language Retrieval and Visual Question Answering, present results in terms of overall performance. > Unfortunately, this approach overlooks more nuanced concepts, leaving us unaware of which specific concepts contribute to the success of current models and which are ignored. > In response to this limitation, more recent benchmarks attempt to assess particular aspects of vision-language models. > Some existing datasets focus on linguistic concepts utilizing one image paired with multiple captions; others adopt a visual or cross-modal perspective. > In this study, we are particularly interested in fine-grained visual concept understanding, which we believe is not covered in existing benchmarks in sufficient isolation. > Therefore, we create the InpaintCOCO dataset which consists of image pairs with minimum differences that lead to changes in the captions. Download the dataset: ```python from datasets import load_dataset dataset = load_dataset("phiyodr/inpaintCOCO") ``` ### Supported Tasks and Leaderboards InpaintCOCO is a benchmark to understand fine-grained concepts in multimodal models (vision-language) similar to [Winoground](https://huggingface.co/datasets/facebook/winoground). To our knowledge InpaintCOCO is the first benchmark, which consists of image pairs with minimum differences, so that the *visual* representation can be analyzed in a more standardized setting. ### Languages All texts are in English. ## Dataset Structure ```python DatasetDict({ test: Dataset({ features: ['concept', 'coco_caption', 'coco_image', 'inpaint_caption', 'inpaint_image', 'mask', 'worker', 'coco_details', 'inpaint_details', 'mask_details'], num_rows: 1260 }) }) ``` ### Data Instances An example looks as follows: ```python {'concept': 'object', 'coco_caption': 'A closeup of a large stop sign in the bushes.', 'coco_image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512>, 'inpaint_caption': 'A wooden bench in the bushes.', 'inpaint_image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512>, 'mask': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512>, 'worker': 'k', 'coco_details': {'captions': ['A stop sign is shown among foliage and grass.', 'A close up of a Stop sign near woods. ', 'A closeup of a large stop sign in the bushes.', 'A large oval Stop sign near some trees.', 'a close up of a stop sign with trees in the background'], 'coco_url': 'http://images.cocodataset.org/val2017/000000252332.jpg', 'date_captured': '2013-11-17 08:29:48', 'flickr_url': 'http://farm6.staticflickr.com/5261/5836914735_bef9249442_z.jpg', 'height': 480, 'id': 252332, 'image_license': 'https://creativecommons.org/licenses/by/2.0/', 'text_license': 'https://creativecommons.org/licenses/by/4.0/legalcode', 'width': 640}, 'inpaint_details': {'duration': 18, 'guidance_scale': 7.5, 'num_inference_steps': 100, 'prompt': 'wooden bench', 'prompts_used': 2, 'quality': 'very good'}, 'mask_details': {'height_factor': 25, 'prompt': 'stop sign', 'prompts_used': 1, 'width_factor': 25}} ``` ## Dataset Creation > The challenge set was created by undergraduate student workers. They were provided with an interactive Python environment with which they interacted via various prompts and inputs. > The annotation proceeds as follows: The annotators are provided with an image and decide if the image is suitable for editing. If yes, they input the prompt for the object that should be replaced. Using the open vocabulary segmentation model [CLIPSeg](https://huggingface.co/CIDAS/clipseg-rd64-refined) ([Lüddecke and Ecker, 2022](https://openaccess.thecvf.com/content/CVPR2022/html/Luddecke_Image_Segmentation_Using_Text_and_Image_Prompts_CVPR_2022_paper.html)) we obtain a mask for our object of interest (i.e., "fire hydrant"). Then, the annotator inputs a prompt for [Stable Diffusion v2 Inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) ([Rombach et al., 2022](https://ommer-lab.com/research/latent-diffusion-models/)) (e.g. with the prompt "yellow fire hydrant"), which shows three candidate images. The annotators can try new prompts or skip the current image if the result is insufficient. Finally, the annotator enters a new caption that matches the edited image. #### Source Data InpaintCOCO is based on MS COCO 2017 validation set ([image](http://images.cocodataset.org/zips/val2017.zip), [annotations](http://images.cocodataset.org/annotations/annotations_trainval2014.zip)). ``` @misc{lin2015microsoft, title={Microsoft COCO: Common Objects in Context}, author={Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and Piotr Dollár}, year={2015}, eprint={1405.0312}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` ## Limitations > The images in the COCO dataset come from Flickr from 2014; therefore, they reflect the Flickr user structure at that time, i.e., the images mostly show the Western world and/or other countries from the Western perspective. The captions are in English. Thus, the model we developed does not generalize well beyond the Western world ## Licensing Information * Images come with individual licenses (`image_license`) based on their Flickr source. The possible licenses are * [CC BY-NC-SA 2.0 Deed](https://creativecommons.org/licenses/by-nc-sa/2.0/), * [CC BY-NC 2.0 Deed](https://creativecommons.org/licenses/by-nc/2.0/), * [CC BY 2.0 Deed](https://creativecommons.org/licenses/by/2.0/), and * [CC BY-SA 2.0 Deed](https://creativecommons.org/licenses/by-sa/2.0/). * The remaining work comes with the [CC BY 4.0 Legal Code](https://creativecommons.org/licenses/by/4.0/legalcode) license. ## Citation Information Our InpaintCOCO dataset: ``` @misc{roesch2024enhancing, title={Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples}, author={Philipp J. Rösch and Norbert Oswald and Michaela Geierhos and Jindřich Libovický}, year={2024}, eprint={2403.02875}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` For the MS COCO dataset please see above.

提供机构：

phiyodr

原始信息汇总

InpaintCOCO 数据集概述

数据集摘要

InpaintCOCO 数据集包含两张图片和两个对应的描述，这些描述仅在一个对象、对象颜色或对象大小上有所不同。该数据集旨在评估多模态模型（视觉-语言）对细粒度视觉概念的理解能力。

支持的任务和排行榜

InpaintCOCO 是一个用于理解多模态模型中细粒度概念的基准数据集，类似于 Winoground。它是第一个由具有最小差异的图像对组成的数据集，以便在更标准化的设置中分析视觉表示。

语言

所有文本均为英语。

数据集结构

数据实例

数据集包含以下特征：

concept: 概念，数据类型为字符串。
coco_caption: COCO 图片描述，数据类型为字符串。
coco_image: COCO 图片，数据类型为图像。
inpaint_caption: 修改后的图片描述，数据类型为字符串。
inpaint_image: 修改后的图片，数据类型为图像。
mask: 掩码，数据类型为图像。
worker: 标注者，数据类型为字符串。
coco_details: COCO 图片详细信息，包含多个字段，如 captions、coco_url、date_captured 等。
inpaint_details: 修改后的图片详细信息，包含多个字段，如 duration、guidance_scale、num_inference_steps 等。
mask_details: 掩码详细信息，包含多个字段，如 height_factor、prompt、prompts_used 等。

数据分割

数据集包含一个测试集，包含 1260 个样本，总大小为 1062104623.5 字节。

数据集创建

数据集由本科生标注者创建，他们使用交互式 Python 环境进行标注。标注过程包括选择合适的图片、输入替换对象的提示、使用 CLIPSeg 模型获取对象掩码、输入 Stable Diffusion v2 Inpainting 模型的提示以生成候选图片，并最终输入与编辑后图片匹配的新描述。

数据集来源

InpaintCOCO 基于 MS COCO 2017 验证集。

限制

COCO 数据集的图片来自 2014 年的 Flickr，主要反映西方世界的图像。因此，开发出的模型可能无法很好地泛化到西方世界之外。

许可信息

图片根据其 Flickr 来源具有不同的许可（image_license）。
其余工作采用 CC BY 4.0 许可。

引用信息

InpaintCOCO 数据集的引用信息如下：

@misc{roesch2024enhancing, title={Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples}, author={Philipp J. Rösch and Norbert Oswald and Michaela Geierhos and Jindřich Libovický}, year={2024}, eprint={2403.02875}, archivePrefix={arXiv}, primaryClass={cs.CV} }

MS COCO 数据集的引用信息如下：

@misc{lin2015microsoft, title={Microsoft COCO: Common Objects in Context}, author={Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and Piotr Dollár}, year={2015}, eprint={1405.0312}, archivePrefix={arXiv}, primaryClass={cs.CV} }

搜集汇总

数据集介绍

构建方式

在计算机视觉与自然语言处理交叉领域，对多模态模型进行细粒度概念理解评估的需求日益凸显。InpaintCOCO数据集的构建依托于MS COCO 2017验证集，通过人工标注与生成模型协同完成。具体而言，标注人员在交互式Python环境中，首先筛选适合编辑的图像，并确定待替换的目标对象；随后利用开放词汇分割模型CLIPSeg生成对象掩码，再借助Stable Diffusion v2 Inpainting模型，根据文本提示生成替换对象后的候选图像；最终，标注人员为编辑后的图像撰写匹配的新描述，从而形成包含原始图像、编辑后图像、对应描述及掩码的成对样本。

使用方法

在视觉语言模型评估领域，InpaintCOCO可作为一项基准测试，用于衡量模型对细粒度视觉概念的感知与理解能力。研究人员可通过Hugging Face的`datasets`库直接加载数据集，利用其图像-文本对进行模型训练或评估。典型应用场景包括视觉语言检索、视觉问答及图像描述生成等任务的细粒度性能分析。使用时应关注数据集的局限性，如其图像主要反映西方视角，且描述均为英文，可能影响模型的泛化能力。

背景与挑战

背景概述

InpaintCOCO数据集由Philipp J. Rösch等人于2024年创建，旨在推动多模态模型在细粒度视觉概念理解方面的研究。该数据集基于MS COCO 2017验证集构建，通过图像修复技术生成具有最小差异的图像对，并配以相应描述，专注于对象、颜色和尺寸等概念的精准分析。其核心研究问题在于解决现有多模态任务评估中整体性能指标对细微概念忽略的局限性，为视觉语言检索和视觉问答等任务提供了更精细的评估基准，对多模态人工智能领域的发展具有重要影响。

当前挑战

InpaintCOCO数据集致力于解决多模态模型中细粒度视觉概念理解的挑战，要求模型能够区分图像对中微小的视觉差异，如对象替换或属性变化，这对模型的感知精度提出了更高要求。在构建过程中，数据集面临双重挑战：一是依赖人工标注与图像修复技术生成高质量样本，涉及标注者主观判断与生成模型稳定性的协调；二是源数据MS COCO存在地理与文化偏差，主要反映西方视角，可能限制模型的泛化能力，需在后续研究中加以考量。

常用场景

经典使用场景

在视觉语言模型评估领域，InpaintCOCO数据集以其精妙构建的图像-文本对，为细粒度概念理解提供了标准化测试平台。该数据集通过图像修复技术生成仅存在单一属性差异的图像对，例如物体替换、颜色变化或尺寸调整，并配以相应修改的文本描述。这种设计使得研究者能够精确评估模型对视觉细节的敏感度，尤其在跨模态检索和视觉问答任务中，模型需要区分微妙的视觉差异以匹配正确文本，从而揭示模型在理解颜色、尺寸及物体类别等基础概念时的真实能力。

解决学术问题

该数据集针对当前多模态模型评估中整体性能指标掩盖细粒度概念理解不足的学术困境，提供了系统性的解决方案。传统评估方法往往忽略模型对特定视觉概念的捕捉能力，导致无法辨识模型成功或失败的具体原因。InpaintCOCO通过构建最小差异的图像对，将评估焦点集中于颜色、尺寸和物体类别等孤立概念，使研究者能够定量分析模型视觉表征的精确性。这一创新填补了现有基准测试在视觉概念隔离评估上的空白，推动了多模态理解向更细致、更可解释的方向发展。

实际应用

在实际应用层面，InpaintCOCO数据集为提升视觉语言系统的可靠性和准确性提供了关键训练与验证资源。在内容生成领域，该数据集可用于优化图像修复与编辑模型，确保生成内容在细节上与文本指令保持一致。在辅助技术中，如视觉辅助描述系统，数据集有助于训练模型更精准地捕捉环境中的物体属性，为视障用户提供更细致的场景描述。此外，在自动驾驶的视觉感知模块开发中，模型通过该数据集学习区分相似物体的细微差异，可增强系统对交通标志、行人属性等关键信息的识别鲁棒性。

数据集最近研究