UCM-Captions, Sydney-Captions, RSICD, RSITMD, NWPU-Captions, RS5M, SkyScript

github2024-12-09 更新2024-12-10 收录

下载链接：

https://github.com/BaolanChen/Awesome-Remote-Sensing-Cross-Modal-Image-Text-Retrieval

下载链接

链接失效反馈

官方服务：

资源简介：

UCM-Captions: 包含613张图像，分辨率为256×256。Sydney-Captions: 包含2,100张图像，分辨率为500×500。RSICD: 包含10,921张图像，分辨率为224×224。RSITMD: 包含4,743张图像，分辨率为256×256。NWPU-Captions: 包含31,500张图像，分辨率为256×256。RS5M: 包含超过500万张图像，分辨率为所有可能的分辨率。SkyScript: 包含520万张图像，分辨率为所有可能的分辨率。

UCM-Captions: Contains 613 images with a resolution of 256×256. Sydney-Captions: Contains 2,100 images with a resolution of 500×500. RSICD: Contains 10,921 images with a resolution of 224×224. RSITMD: Contains 4,743 images with a resolution of 256×256. NWPU-Captions: Contains 31,500 images with a resolution of 256×256. RS5M: Contains over 5 million images with arbitrary resolutions. SkyScript: Contains 5.2 million images with arbitrary resolutions.

创建时间：

2024-11-19

原始信息汇总

Awesome-Remote-Sensing-Cross-Modal-Image-Text-Retrieval

数据集概述

遥感图像-文本数据集

数据集名称	图像数量	图像分辨率	VLMs
UCM-Captions	613	256 × 256	-
Sydney-Captions	2,100	500 × 500	-
RSICD	10,921	224 × 224	-
RSITMD	4,743	256 × 256	-
NWPU-Captions	31,500	256 × 256	-
RS5M	5 million+	所有分辨率	GeoRSCLIP
SkyScript	5.2 million+	所有分辨率	SkyCLIP

遥感跨模态图像-文本检索模型

论文	标题	出版物	机构	代码
CDMAN	Thread the Needle: Cues-Driven Multi-Association for Remote Sensing Cross-Modal Retrieval	TGRS 2024	Wuhan University of Technology	-
MSA	Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval	TGRS 2024	Xidian University	Github
KTIR	Knowledge-aware Text-Image Retrieval for Remote Sensing Images	TGRS 2024	EPFL	-
CMPAGL	Cross-Modal Prealigned Method With Global and Local Information for Remote Sensing Image and Text Retrieval	TGRS 2024	Shanghai Maritime University	Github
FGIS	Fine-Grained Information Supplementation and Value-Guided Learning for Remote Sensing Image-Text Retrieval	JSTARS 2024	Chongqing University	-
EBAKER	Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning	ACMMM 2024	Tianjin University	-
CUP	Cross-Modal Remote Sensing Image–Text Retrieval via Context and Uncertainty-Aware Prompt	TNNLS 2024	Xidian University	Github
CCLS2T	Cross-Modal Contrastive Learning With Spatiotemporal Context for Correlation-Aware Multiscale Remote Sensing Image Retrieval	TGRS 2024	Xidian University	-
MIIA	Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval	TGRS 2024	Northwestern Polytechnical University	-
SARCI	Scale-Aware Adaptive Refinement and Cross-Interaction for Remote Sensing Audio-Visual Cross-Modal Retrieval	TGRS 2024	Wuhan University of Technology	Github
GLISA	Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning	TGRS 2024	China University of Mining and Technology	-
SCAT	Spatial–Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text Retrieval	TGRS 2024	Northwestern Polytechnical University	-
FSISR	Cross-Modal Hashing With Feature Semi-Interaction and Semantic Ranking for Remote Sensing Ship Image Retrieval	TGRS 2024	Harbin Institute of Technology	-
SkyEyeGPT	Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model	Arxiv 2024	Northwestern Polytechnical University	Github
MFF-SFE	Cross-modal retrieval method based on MFF-SFE for remote sensing image-text	中国科学院大学学报 2024	Aerospace Information Research Institute, Chinese Academy of Sciences	-
RemoteCLIP	RemoteCLIP: A Vision Language Foundation Model for Remote Sensing	TGRS 2024	Hohai University	Github
C2F-ITR	From Coarse To Fine: An Offline-Online Approach for Remote Sensing Cross-Modal Retrieval	IGARSS 2024	Beijing Foreign Studies University	-
MGRM-EL	Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval	TGRS 2024	Northwestern Polytechnical University	-
SIRS	Multitask Joint Learning for Remote Sensing Foreground-Entity Image–Text Retrieval	TGRS 2024	Soochow University	Github
PIR	A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval	ACMMM 2023 oral	Zhejiang University of Technology	Github
PE-RSITR	Parameter-Efficient Transfer Learning for Remote Sensing Image–Text Retrieval	TGRS 2023	Northwestern Polytechnical University	Github
HVSA	Hypersphere-Based Remote Sensing Cross-Modal Text–Image Retrieval via Curriculum Learning	TGRS 2023	Aerospace Information Research Institute, Chinese Academy of Sciences	Github
SWAN	Reducing Semantic Confusion Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval	ICMR 2023 oral	Zhejiang University of Technology	Github
KAMCL	Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text Retrieval	TGRS 2023	Tianjin University	Github
IEFT	Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval	TGRS 2023	Xidian University	Github
Multilanguage Transformer	Multilanguage Transformer for Improved Text to Remote Sensing Image Retrieval	JSTARS 2022	King Saud University	-
GaLR	Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information	TGRS 2022	Aerospace Information Research Institute, Chinese Academy of Sciences	Github
AMFMN	Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval	TGRS 2021	Aerospace Information Research Institute, Chinese Academy of Sciences	Github
LW-MCR	A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing	TGRS 2021	Aerospace Information Research Institute, Chinese Academy of Sciences	Github
VSE++	VSE++: Improving Visual-Semantic Embeddings with Hard Negatives	BMVC 2018 spotlight	University of Toronto	Github

遥感视觉基础模型

缩写	标题	出版物	论文	代码与权重
GeoKR	Geographical Knowledge-Driven Representation Learning for Remote Sensing Images	TGRS2021	GeoKR	link
GASSL	Geography-Aware Self-Supervised Learning	ICCV2021	GASSL	link

遥感视觉-语言基础模型

缩写	标题	出版物	论文	代码与权重
RSGPT	RSGPT: A Remote Sensing Vision Language Model and Benchmark	Arxiv2023	RSGPT	link
RemoteCLIP	RemoteCLIP: A Vision Language Foundation Model for Remote Sensing	Arxiv2023	RemoteCLIP	link
GeoRSCLIP	RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model	Arxiv2023	GeoRSCLIP	link
GRAFT	Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment	ICLR2024	GRAFT	-

遥感视觉-位置基础模型

缩写	标题	出版物	论文	代码与权重
CSP	CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations	ICML2023	CSP	link
GeoCLIP	GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization	NeurIPS2023	GeoCLIP	link
SatCLIP	SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery	Arxiv2023	SatCLIP	link

搜集汇总

数据集介绍

构建方式

在遥感领域，UCM-Captions、Sydney-Captions、RSICD、RSITMD、NWPU-Captions、RS5M和SkyScript等数据集的构建，旨在支持跨模态图像-文本检索任务。这些数据集通过收集和标注大量高分辨率遥感图像及其对应的文本描述，确保了数据集的多样性和广泛性。图像分辨率从224×224到500×500不等，涵盖了多种场景和地物类型，为模型训练提供了丰富的视觉和语义信息。

特点

这些数据集的主要特点在于其高分辨率和多样性，能够有效支持遥感图像与文本之间的跨模态检索任务。此外，数据集的规模从数千到数百万不等，确保了训练模型的广泛适用性和鲁棒性。特别是RS5M和SkyScript，它们不仅包含大量图像，还支持多种分辨率，为不同应用场景提供了灵活性。

使用方法

使用这些数据集进行模型训练时，首先需要根据任务需求选择合适的图像和文本对。随后，可以采用预处理技术对图像进行标准化处理，如调整分辨率和归一化。对于文本部分，通常需要进行分词和编码处理。训练过程中，可以采用对比学习、多模态融合等技术，以提高模型在跨模态检索任务中的表现。最终，通过验证集评估模型性能，并进行必要的调优。

背景与挑战

背景概述

遥感技术在现代地理信息系统、环境监测和灾害管理等领域中占据重要地位。近年来，随着跨模态数据处理技术的发展，遥感图像与文本数据的联合分析成为研究热点。UCM-Captions、Sydney-Captions、RSICD、RSITMD、NWPU-Captions、RS5M和SkyScript等数据集的创建，旨在推动遥感图像与文本跨模态检索的研究。这些数据集由多个知名机构如武汉大学、西安电子科技大学和沙特国王大学等共同开发，主要解决遥感图像与文本之间的语义对齐问题，对提升遥感数据的理解和应用具有重要意义。

当前挑战

构建这些数据集面临多重挑战。首先，遥感图像与自然语言描述之间的语义鸿沟较大，如何准确匹配图像与文本描述是一大难题。其次，数据集的构建需要处理大量高分辨率图像，这对存储和计算资源提出了高要求。此外，不同数据集之间的标准化和互操作性问题也亟待解决，以确保研究成果的可重复性和广泛应用。最后，随着遥感技术的不断进步，数据集需要不断更新以反映最新的技术发展和应用需求。

常用场景

经典使用场景

在遥感领域，UCM-Captions, Sydney-Captions, RSICD, RSITMD, NWPU-Captions, RS5M, SkyScript等数据集的经典应用场景主要集中在跨模态图像-文本检索（RSCMIT）。这些数据集通过提供大规模的遥感图像及其对应的文本描述，支持研究人员开发和验证基于视觉和语言的模型。例如，这些数据集常用于训练和评估图像-文本匹配模型，以实现从文本描述中检索相关遥感图像或反之。此外，这些数据集还用于研究多模态学习中的特征对齐问题，以提高模型的跨模态理解和推理能力。

衍生相关工作

基于这些遥感图像-文本数据集，研究人员开发了多种经典工作。例如，CDMAN、MSA、KTIR等模型通过引入多模态对齐和知识增强技术，显著提升了图像-文本检索的准确性。CMPAGL和CCLS2T等方法则通过全局和局部信息的结合，进一步优化了跨模态检索的效果。此外，SkyEyeGPT和RemoteCLIP等模型通过大规模预训练和指令微调，实现了更强大的遥感视觉-语言基础模型。这些衍生工作不仅在学术界引起了广泛关注，也在实际应用中展现了巨大的潜力，推动了遥感跨模态检索技术的不断进步。

数据集最近研究