Proposed methodology results.
收藏Figshare2026-03-16 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/_p_Proposed_methodology_results_p_/31757635
下载链接
链接失效反馈官方服务:
资源简介:
This paper provides a novel deep learning model for captioning of images by using an advanced vision transformer architecture with a powerful LLM. Proposed models show a significant improvement over traditional CNN-RNN hybrids and existing transformer-based approaches by integrating a unique cross-attention mechanism that enables deep alignment between linguistic context and visual features. We show the superiority of our proposed architecture through extensive evaluation on different datasets like MSCOCO, Flickr30K, and NoCaps. The proposed model consistently shows good performance for leading methods such as GIT, BLIP-2, and CoCa across a comprehensive suite of metrics. On the MS COCO dataset, the BLEU-4, METEOR, and CIDEr scores of proposed models are equal to 0.495, 0.390, and 1.32, respectively. In this paper, we have critically analyzed the key challenges of this field, like enhancing caption diversity, ensuring robust multimodal alignment, and mitigating inherent biases. By providing a new performance level, the proposed model provides a source of reference for the next generation of image captioning systems. The results show the efficiency of our fusion strategy and facilitate the development of techniques that use models that can produce more precise, contextually rich, and human-like image depictions. This work supports SDG 9 (Industry, Innovation, and Infrastructure) by advancing multimodal AI systems, and SDG 4 (Quality Education) by enabling intelligent and accessible image understanding technologies.
本研究提出一种基于先进视觉Transformer(Vision Transformer)架构与高性能大语言模型(Large Language Model, LLM)的新型图像字幕生成深度学习模型。所提模型通过引入独特的跨注意力机制,实现语言上下文与视觉特征的深度对齐,相较传统卷积神经网络-循环神经网络(Convolutional Neural Network-Recurrent Neural Network, CNN-RNN)混合架构及现有基于Transformer的方法实现了显著性能提升。
我们在MSCOCO、Flickr30K及NoCaps等多个公开数据集上开展了大量评估实验,验证了所提架构的性能优越性。在全面的评测指标体系下,所提模型的表现始终优于GIT、BLIP-2与CoCa等主流先进方法。在MS COCO数据集上,所提模型的BLEU-4、METEOR及CIDEr得分分别为0.495、0.390与1.32。
本文还对该领域的核心挑战进行了批判性分析,包括提升字幕生成多样性、保障鲁棒的多模态对齐以及缓解固有偏差等问题。所提模型树立了新的性能标杆,可为下一代图像字幕生成系统提供重要参考依据。实验结果验证了所提融合策略的有效性,同时推动了可生成更精准、语义更丰富且更贴近人类表达的图像描述的相关技术发展。
本研究通过推进多模态人工智能系统的发展,契合可持续发展目标(Sustainable Development Goals, SDG)9(产业、创新与基础设施);同时通过赋能智能且可及的图像理解技术,支撑SDG 4(优质教育)。
创建时间:
2026-03-16



