Validation of module effectiveness.

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://figshare.com/articles/dataset/Validation_of_module_effectiveness_/29463396

下载链接

链接失效反馈

官方服务：

资源简介：

Addressing the limitations in current visual question answering (VQA) models face limitations in multimodal feature fusion capabilities and often lack adequate consideration of local information, this study proposes a multimodal Transformer VQA network based on local and global information integration (LGMTNet). LGMTNet employs attention on local features within the context of global features, enabling it to capture both broad and detailed image information simultaneously, constructing a deep encoder-decoder module that directs image feature attention based on the question context, thereby enhancing visual-language feature fusion. A multimodal representation module is then designed to focus on essential question terms, reducing linguistic noise and extracting multimodal features. Finally, a feature aggregation module concatenates multimodal and question features to deepen question comprehension. Experimental results demonstrate that LGMTNet effectively focuses on local image features, integrates multimodal knowledge, and enhances feature fusion capabilities.

当前视觉问答（Visual Question Answering，VQA）模型存在多模态特征融合能力不足，且未充分考量局部信息的局限。针对上述问题，本研究提出一种基于局部与全局信息融合的多模态Transformer视觉问答网络（LGMTNet）。LGMTNet在全局特征语境下对局部特征应用注意力机制，可同时捕获宽泛的全局图像信息与精细的局部图像细节；其构建了基于问题语境引导图像特征注意力的深度编码器-解码器模块，以此强化视觉-语言特征融合。随后设计的多模态表征模块，可聚焦关键问题术语，以降低语言噪声并提取多模态特征。最后通过特征聚合模块拼接多模态特征与问题特征，深化对问题语义的理解。实验结果表明，LGMTNet可有效聚焦局部图像特征、整合多模态知识，并提升特征融合能力。

创建时间：

2025-07-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集