Experimental results of TBKIN on RefCOCO.
收藏Figshare2025-06-10 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Experimental_results_of_TBKIN_on_RefCOCO_/29284538
下载链接
链接失效反馈官方服务:
资源简介:
Vision-language models aim to seamlessly integrate visual and linguistic information for multi-modal tasks, demanding precise semantic alignments between image-text pairs while minimizing the influence of irrelevant data. While existing methods leverage intra-modal and cross-modal knowledge to enhance alignments, they often fall short in sufficiently reducing interference, which ultimately constrains model performance. To address this gap, we propose a novel vision-language model, the threshold-based knowledge integration network (TBKIN), designed to effectively capture intra-modal and cross-modal knowledge while systematically mitigating the impact of extraneous information. TBKIN employs unified scene graph structures and advanced masking strategies to strengthen semantic alignments and introduces a fine-tuning strategy based on threshold selection to eliminate noise. Comprehensive experimental evaluations demonstrate the efficacy of TBKIN, with our best model achieving state-of-the-art accuracy of 73.90% on the VQA 2.0 dataset and 84.60% on the RefCOCO dataset. Attention visualization and detailed result analysis further validate the robustness of TBKIN in tackling vision-language tasks. The model’s ability to reduce interference while enhancing semantic alignments underscores its potential for advancing multi-modal learning. Extensive experiments across four widely-used benchmark datasets confirm its superior performance on two typical vision-language tasks, offering a practical and effective solution for real-world applications.
创建时间:
2025-06-10



