daintlab/reguide

Name: daintlab/reguide
Creator: daintlab
Published: 2025-03-04 05:09:53
License: 暂无描述

Hugging Face2025-03-04 更新2025-11-29 收录

下载链接：

https://hf-mirror.com/datasets/daintlab/reguide

下载链接

链接失效反馈

官方服务：

资源简介：

# Reflexive Guidance (ReGuide) [![arXiv](https://img.shields.io/badge/arXiv-2410.14975-FF9999.svg)](https://arxiv.org/abs/2410.14975) [![OpeRreview](https://img.shields.io/badge/OpenReview-ReGuide-6699FF.svg)](https://openreview.net/forum?id=R4h5PXzUuU&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DICLR.cc%2F2025%2FConference%2FAuthors%23your-submissions)) Official repository for the ICLR 2025 paper "Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation" This repository provides 1) image lists sampeld from the standard OoD setting benchmarks in [OpenOOD v1.5](https://github.com/Jingkang50/OpenOOD) for the experiments in our paper - Specifially, we sampled the CIFAR10 and ImageNet200 benchmarks as: | | CIFAR10 | ImageNet200 | |:--------------:|:-------:|:-----------:| | Sampling ratio | 25% | 25%, 5% | ensuring that the proportion of datasets in each benchmark are maintained. 2) prompt-response pairs obtained from the main experiments | | CIFAR10 | ImageNet200 | |:--------:|:--------:|:-----------:| | Baseline | 25% | 25%, 5% | | ReGuide | 25% | 5% | We hope that the image lists and prompt-response pairs in this repository can be leveraged to support future research and facilitate thorough comparisons. ## Dataset & Response The overall structure of this repository is as follows, with the results for each sample located under the model directory. ```sh dataset ├─ cifar10 │ └─ subset_25%.jsonl └─ imagenet200 response ├─ baseline │ ├─ cifar10-25% │ │ ├─ glm │ │ │ ... │ │ └─ qwen │ ├─ imagenet200-25% │ └─ imagenet200-5% └─ reguide └─ imagenet200-5% ├─ stage1 ├─ stage2 └─ filtering ``` ### Preliminary Our JSONL files for dataset are reorganized based on benchamarks provided by [OpenOOD](https://github.com/Jingkang50/OpenOOD). You can prepare the whole OpenOOD image lists by following the steps below. First, create the required data directory structure by running the following command: ```sh mkdir data ``` Then, you can download the dataset using the data download script provided by [OpenOOD](https://github.com/Jingkang50/OpenOOD). After downloading, please ensure that the `images_classic` and `images_largescale` directories are placed inside the `./data` directory. The directory structure should look like this: ```sh data ├─ images_classic │ ├─ cifar10 │ ├─ cifar100 │ └─ ... └─ images_largesacle ``` `image_id` in our dataset JSONL files are the actual path of images in this OpenOOD directory, for example, `./data/images_classic/cifar10/test/airplane/0001.png`. ### Dataset For **list of images**, each JSONL file we provide is structured as follows: - Baseline ```json { 'dataset': { 'label': [ image_id1, image_id2_, ... ] } } ``` ### Response For **prompt-respons pairs**, each JSONL file we provide is structured as follows for **baseline** and **ReGuide** experiments: - Baseline ```json { 'prompt': { 'image_id': 'response' } } ``` - ReGuide ```json { 'image_id': { 'prompt': 'response' } } ``` The `image_id` field in the JSONL files corresponds to the actual file paths of the image files as mentioned above. If you followed the preliminary steps above, the `image_id` values will match their actual locations, so you can use them directly. ## Overview ### Abstract With the recent emergence of foundation models trained on internet-scale data and demonstrating remarkable generalization capabilities, such foundation models have become more widely adopted, leading to an expanding range of application domains. Despite this rapid proliferation, the trustworthiness of foundation models remains underexplored. Specifically, the out-of-distribution detection (OoDD) capabilities of large vision-language models (LVLMs), such as GPT-4o, which are trained on massive multi-modal data, have not been sufficiently addressed. The disparity between their demonstrated potential and practical reliability raises concerns regarding the safe and trustworthy deployment of foundation models. To address this gap, we evaluate and analyze the OoDD capabilities of various proprietary and open-source LVLMs. Our investigation contributes to a better understanding of how these foundation models represent confidence scores through their generated natural language responses. Furthermore, we propose a self-guided prompting approach, termed Reflexive Guidance (ReGuide), aimed at enhancing the OoDD capability of LVLMs by leveraging self-generated image-adaptive concept suggestions. Experimental results demonstrate that our ReGuide enhances the performance of current LVLMs in both image classification and OoDD tasks. ### OoD Detection for LVLMs <img src="./assets/overview.png"> Given the vast amount and broad domain coverage of data used to train LVLMs, we frame the OoDD problem for LVLMs based on the zero-shot OoDD scenario defined for CLIP. Our prompt consists of four components: a task description, an explanation of the rejection class, guidelines, and examples for the response format. ### ReGuide Framework <img src="./assets/reguide-framework.png"> We introduce a simple and model-agnostic prompting strategy, Reflexive Guidance (ReGuide), to enhance the OoD detectability of LVLMs. The LVLM’s strong generalization ability has been demonstrated through its performance across various downstream tasks. Therefore, we leverage the LVLM itself to obtain guidance for OoDD from its powerful zero-shot visual recognition capabilities. ReGuide is implemented in a two-stage process: Stage 1 Image-adaptive class suggestions and Stage 2 OoDD with suggested classes. ## Citation ``` @inproceedings{kim2025reflexive, title={Reflexible Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation}, author={Jihyo Kim and Seulbi Lee and Sangheum Hwang}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025} } ```

# 反思式引导（Reflexive Guidance，ReGuide） [![arXiv](https://img.shields.io/badge/arXiv-2410.14975-FF9999.svg)](https://arxiv.org/abs/2410.14975) [![OpeRreview](https://img.shields.io/badge/OpenReview-ReGuide-6699FF.svg)](https://openreview.net/forum?id=R4h5PXzUuU&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DICLR.cc%2F2025%2FConference%2FAuthors%23your-submissions)) 本仓库为ICLR 2025论文《反思式引导：通过自引导图像自适应概念生成提升视觉语言模型的分布外检测性能》的官方实现仓库。本仓库提供以下内容： 1) 从[OpenOOD v1.5](https://github.com/Jingkang50/OpenOOD)的标准分布外（Out-of-Distribution, OoD）基准测试中采样的图像列表，用于支撑本文的实验。具体而言，我们对CIFAR10与ImageNet200基准进行如下采样： | | CIFAR10 | ImageNet200 | |:--------------:|:-------:|:-----------:| | 采样比例 | 25% | 25%、5% | 以此确保各基准数据集的类别比例保持不变。 2) 主实验中得到的提示-响应对，具体配置如下： | | CIFAR10 | ImageNet200 | |:--------:|:--------:|:-----------:| | 基线模型 | 25% | 25%、5% | | ReGuide | 25% | 5% | 我们期望本仓库提供的图像列表与提示-响应对能够为未来的研究提供支撑，并助力更全面的性能对比。 ## 数据集与响应数据本仓库的整体目录结构如下，每个样本的结果均存放在对应模型目录下。 sh dataset ├─ cifar10 │ └─ subset_25%.jsonl └─ imagenet200 response ├─ baseline │ ├─ cifar10-25% │ │ ├─ glm │ │ │ ... │ │ └─ qwen │ ├─ imagenet200-25% │ └─ imagenet200-5% └─ reguide └─ imagenet200-5% ├─ stage1 ├─ stage2 └─ filtering ### 预备步骤我们的数据集JSONL文件基于[OpenOOD](https://github.com/Jingkang50/OpenOOD)提供的基准进行了重构。你可以按照以下步骤准备完整的OpenOOD图像列表：首先运行以下命令创建所需的数据目录结构： sh mkdir data 随后，你可以通过[OpenOOD](https://github.com/Jingkang50/OpenOOD)提供的数据集下载脚本获取数据。下载完成后，请确保将`images_classic`与`images_largescale`目录放置在`./data`目录下，最终目录结构如下： sh data ├─ images_classic │ ├─ cifar10 │ ├─ cifar100 │ └─ ... └─ images_largesacle 本数据集JSONL文件中的`image_id`即为该OpenOOD目录下图像的实际路径，例如`./data/images_classic/cifar10/test/airplane/0001.png`。 ### 数据集格式针对**图像列表**，我们提供的每个JSONL文件结构如下： - 基线模型 json { 'dataset': { 'label': [ image_id1, image_id2_, ... ] } } ### 响应数据格式针对**提示-响应对**，我们为基线模型与ReGuide实验提供的每个JSONL文件结构如下： - 基线模型 json { 'prompt': { 'image_id': 'response' } } - ReGuide json { 'image_id': { 'prompt': 'response' } } JSONL文件中的`image_id`字段与前文所述的图像实际文件路径一一对应。若你已按照前述预备步骤完成数据准备，`image_id`的值将与其实际存储位置匹配，可直接使用。 ## 综述 ### 摘要随着基于互联网规模数据训练、展现出卓越泛化能力的基础模型不断涌现，这类模型的应用场景日益广泛，覆盖领域持续拓展。尽管发展迅猛，但基础模型的可信性仍未得到充分探索。具体而言，诸如GPT-4o这类基于海量多模态数据训练的大型视觉语言模型（Large Vision-Language Models, LVLM）的分布外检测（Out-of-Distribution Detection, OoDD）能力尚未得到足够的研究。其展现出的巨大潜力与实际可靠性之间的差距，引发了人们对基础模型安全可信部署的担忧。为填补这一研究空白，我们对多款闭源与开源LVLM的OoDD能力进行了评估与分析。本研究有助于我们进一步理解这类基础模型如何通过生成的自然语言响应来表征置信度分数。此外，我们提出了一种自引导提示方法，名为反思式引导（Reflexive Guidance, ReGuide），旨在通过利用模型自主生成的图像自适应概念建议，提升LVLM的OoDD能力。实验结果表明，ReGuide能够有效改善当前LVLM在图像分类与OoDD任务中的性能。 ### 面向LVLM的分布外检测 <img src="./assets/overview.png"> 鉴于LVLM训练数据的规模庞大且覆盖领域广泛，我们基于为CLIP定义的零样本（Zero-shot）OoDD场景，构建了LVLM的OoDD问题。我们的提示词包含四个组成部分：任务描述、拒绝类别的解释、指导准则以及响应格式示例。 ### ReGuide框架 <img src="./assets/reguide-framework.png"> 我们提出了一种简单且与模型无关的提示策略——反思式引导（ReGuide），以增强LVLM的分布外检测能力。LVLM强大的泛化能力已在诸多下游任务中得到验证，因此我们利用LVLM自身的强大零样本视觉识别能力，为OoDD获取引导信息。ReGuide通过两个阶段实现：阶段1 图像自适应类别建议，阶段2 基于建议类别的OoDD。 ## 引用 @inproceedings{kim2025reflexive, title={反思式引导：通过自引导图像自适应概念生成提升视觉语言模型的分布外检测性能}, author={Jihyo Kim and Seulbi Lee and Sangheum Hwang}, booktitle={第十三届国际学习表征会议}, year={2025} }

提供机构：

daintlab

5,000+

优质数据集

54 个

任务类型

进入经典数据集