Improving image-text alignment with an optimal feature sub-space-aware similarity learning framework
收藏中国科学数据2026-04-17 更新2026-04-25 收录
下载链接:
https://www.sciengine.com/AA/doi/10.1007/s11432-024-4845-2
下载链接
链接失效反馈官方服务:
资源简介:
Image-text alignment serves as a fundamental cross-modal research topic to bridge vision and language. Its key challenge lies in accurately measuring the similarity of these two heterogeneous modalities. For visual and textual features, most existing methods leverage cosine or Euclidean distance to measure similarity, where the modality features are directly examined in the whole representation space.However, we discover that partial local dimensions, forming sub-spaces with the potential semantic representation tendency, contain more important semantic measurement information. Thus, we argue that existing methods fail to focus on the finer alignment of critical sub-spaces composed of partial dimensions, leading to limited and inaccurate similarity learning. To address this problem, we propose a novel optimal feature sub-space-aware similarity learning framework (OPEN), which takes a forward step to focus on the sub-space composed of local dimensions within modality representations, enabling more subtle semantic alignment and similarity measurement. Specifically, we first construct hierarchical sub-space-aware patterns for learning similarity, i.e., the sub-space comprised of different sizes of local dimensions. Then, for the optimality of the OPEN, there are two new aspects: (1) optimal sub-space-aware patterns, where we reveal which size-level of local dimensions in the sub-space pattern can achieve the optimal performance gains with maximum probability; (2) optimal combined sub-space-aware patterns, in which we mine the optimal complementarities of different size-level patterns. The proposed OPEN enjoys the merit of plug-and-play, and we extensively experiment with it on typical cross-modal alignment paradigms and datasets. OPEN offers consistent and significant performance improvements across different settings, verifying its superiority for simplicity, generality, and effectiveness.
创建时间:
2026-03-10



