Research on Cross-Modal Image-Text Retrieval Based on Cross Attention and Feature Aggregation

中国科学数据2026-02-09 更新2026-04-25 收录

下载链接：

https://www.sciengine.com/AA/doi/10.19678/j.issn.1000-3428.0070119

下载链接

链接失效反馈

官方服务：

资源简介：

Image-text retrieval has become an important research direction in cross modal fields. However, the existing methods of aggregating multiple modal features face two major challenges: insufficient feature alignment between modalities and semantic representation loss within modalities. A cross modal image-text retrieval model based on cross attention and feature aggregation is proposed to address the problem of representation of feature information within modalities. This model includes modules such as image and text feature extraction, cross attention, feature pooling, and feature fusion. It combines the triplet loss function to mine local information in images and text, for obtaining image and text feature representations with deep semantic relationships. The model adopts an attention fusion strategy, which regulates the fusion of fine-grained features between images and texts using learnable weight parameters. A feature pooling module that aggregates image region features and text sequence features separately, learns weight parameters through neural networks, and combines multiple similarities to guide model learning is designed. This module can flexibly handle the features of variable length sequences of images and text, enhancing the ability of the model to capture cross modal information. Comparative experiments conducted on the public datasets MS COCO and Flickr 30k, reveal that compared with various image and text retrieval models, this model has higher retrieval performance. It has advantages in semantic feature pooling and dimensionality reduction, providing new concepts for cross modal feature fusion.

创建时间：

2026-02-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集