Jisuanji kexue (Feb 2023)

Visual Question Answering Model Based on Multi-modal Deep Feature Fusion

  • ZOU Yunzhu, DU Shengdong, TENG Fei, LI Tianrui

DOI
https://doi.org/10.11896/jsjkx.211200303
Journal volume & issue
Vol. 50, no. 2
pp. 123 – 129

Abstract

Read online

In the era of big data,with the explosive growth of multi-source heterogeneous data,multi-modal data fusion has attracted much attention of researchers,and visual question answering(VQA) has become a hot topic in multi-modal data fusion due to its image and text fusion processing characteristics.Visual Q&A task is mainly based on the deep feature fusion association and representation of image and text multi-modal data,and inference learning of the fusion feature results,so as to get the conclusion.Traditional visual question answering models tend to miss key information and mostly focus on the superficial modal feature association representation learning between data,but less on the deep semantic feature fusion.To solve the above pro-blems,this paper proposes a visual question answering model based on cross-modal deep interaction of of graphic features.The proposed method uses convolutional neural network and LSTM network to obtain the data features of image and text modes respectively,and builds a novel deep attention learning network based on combination of meta-attention units,to realize interactive learning of attention features within or between modes of image and text.At last,we represent the learning features so as to output the results.The model is tested and evaluated on VQA-v2.0 dataset.Compared with the traditional baseline model,the expe-rimental results show that the performance of the proposed model is significantly improved.

Keywords