Few-Shot Remote Sensing Image Domain Generalization
收藏DataCite Commons2024-08-07 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/documents/few-shot-remote-sensing-image-domain-generalization
下载链接
链接失效反馈官方服务:
资源简介:
In recent years, the success of large-scale visionlanguage models (VLMs) such as CLIP has led to their increased usage in various computer vision tasks. These models enable zero-shot inference through carefully crafted instructional text prompts without task-specific supervision. However, the potential of VLMs for generalization tasks in remote sensing (RS) has not been fully realized. To address this research gap, we propose a novel image-conditioned prompt learning strategy called the Visual Attention Parameterized Prompts Learning Network (APPLeNet). APPLeNet emphasizes the importance of multi-scale feature learning in RS scene classification and disentangles visual style and content primitives for domain generalization tasks. To achieve this, APPLeNet combines visual content features obtained from different layers of the vision encoder and style properties obtained from feature statistics of domain-specific batches. An attention-driven injection module is further introduced to generate visual tokens from this information. We also introduce an anticorrelation regularizer to ensure discrimination among the token embeddings, as this visual information is combined with the textual tokens. To validate APPLeNet, we curated four available RS benchmarks and introduced experimental protocols and datasets for three domain generalization tasks. Our results consistently outperform the relevant literature and code is available at https://github.com/ mainaksingha01/APPLeNet
近年来,以CLIP为代表的大规模视觉语言模型(Vision-Language Models, VLMs)在各类计算机视觉任务中的应用愈发广泛。这类模型无需任务专属监督信号,即可通过精心构建的指令文本提示实现零样本推理。然而,视觉语言模型在遥感(Remote Sensing, RS)域泛化任务中的潜力尚未得到充分挖掘。
为填补这一研究空白,我们提出一种全新的图像条件提示学习策略,命名为视觉注意力参数化提示学习网络(Visual Attention Parameterized Prompts Learning Network, APPLeNet)。APPLeNet着重强调多尺度特征学习在遥感场景分类中的重要性,并针对域泛化(Domain Generalization)任务解耦视觉风格与内容基元。
为此,APPLeNet融合了从视觉编码器不同层级提取的视觉内容特征,以及从域专属批次的特征统计中得到的风格属性。我们进一步引入了一个注意力驱动的注入模块,基于上述信息生成视觉Token(Token)。鉴于该视觉信息需与文本Token结合,我们还增设了反相关正则化器,以确保各Token嵌入之间具备足够的区分度。
为验证APPLeNet的有效性,我们甄选了四个公开可用的遥感基准数据集,并针对三类域泛化任务设计了实验协议与配套数据集。我们的实验结果始终优于相关现有研究,相关代码已公开于:https://github.com/mainaksingha01/APPLeNet
提供机构:
IEEE DataPort
创建时间:
2024-08-07



