five

Common datasets for image captioning.

收藏
Figshare2026-03-16 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/_p_Common_datasets_for_image_captioning_p_/31757617
下载链接
链接失效反馈
官方服务:
资源简介:
This paper provides a novel deep learning model for captioning of images by using an advanced vision transformer architecture with a powerful LLM. Proposed models show a significant improvement over traditional CNN-RNN hybrids and existing transformer-based approaches by integrating a unique cross-attention mechanism that enables deep alignment between linguistic context and visual features. We show the superiority of our proposed architecture through extensive evaluation on different datasets like MSCOCO, Flickr30K, and NoCaps. The proposed model consistently shows good performance for leading methods such as GIT, BLIP-2, and CoCa across a comprehensive suite of metrics. On the MS COCO dataset, the BLEU-4, METEOR, and CIDEr scores of proposed models are equal to 0.495, 0.390, and 1.32, respectively. In this paper, we have critically analyzed the key challenges of this field, like enhancing caption diversity, ensuring robust multimodal alignment, and mitigating inherent biases. By providing a new performance level, the proposed model provides a source of reference for the next generation of image captioning systems. The results show the efficiency of our fusion strategy and facilitate the development of techniques that use models that can produce more precise, contextually rich, and human-like image depictions. This work supports SDG 9 (Industry, Innovation, and Infrastructure) by advancing multimodal AI systems, and SDG 4 (Quality Education) by enabling intelligent and accessible image understanding technologies.
创建时间:
2026-03-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作