Multimodal reasoning of visual information and natural language

Mendeley Data2024-01-31 更新2024-06-28 收录

下载链接：

https://digitallibrary.usc.edu/asset-management/2A3BF1PGN59T

下载链接

链接失效反馈

官方服务：

资源简介：

Multimodal reasoning focuses on learning the correlation between different modalities presented in multimedia samples. It is a important task which have many applications in our daily lives, e.g., autonomous driving, robotics question answering, image retrieving engines. It is also a challenging task which is closely related to Machine Learning, Natural Language Processing and other research areas in Computer Sciences. Typically, multimodal reasoning can be divided into three levels: i) Coarse level treats each modality as a uniform sample and focus on learning inter-modal correlation. ii) Fine-grained level considers each modality's own characteristics and dive into fine-grained correlation learning. iii) Knowledge level leverages external knowledge to deal with more complex question-answering type reasoning. ❧ This thesis describes my solutions to three levels of multimodal reasoning. Most of the parts focus on the interaction between natural language modality and visual modality. The first part addresses the image retrieval problem which lies in the coarse level. We introduce attention mechanism to attend on useful information within each modality and weights each modality's importance which boosts the image retrieval performance. ❧ In the second part, we address the phrase grounding problem which is in the fine-grained level. We introduce regression mechanism, reinforcement learning techniques and multimodal consistency step by step, transfer from supervised learning scenario to weakly supervised scenario. All these above techniques bring concrete improvements in performance of phrase grounding. ❧ In the third part, we explore Visual Question Answering (VQA) problem in the knowledge level. Similar to human's behavior in VQA, we introduce the attention mechanism to attend on useful regions conditioned by input query's semantics, which filters out noise in visual modality for answering questions. ❧ Finally, we illustrate our recent efforts in other modalities' reasoning. We address the problem of generating sound waves from video content, where we consider fine-grained information and adopt perceptual loss to further polish generated sound waves. Besides, we provide some interesting problems which we plan to address in the future. ❧ This thesis demonstrates the effectiveness of all the algorithms on a range of publicly available video or image datasets.

创建时间：

2024-01-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集