Visual Question Answering Model Based on Multi-modal Deep Feature Fusion

ZOU Yunzhu, DU Shengdong, TENG Fei, LI Tianrui

doi:10.11896/jsjkx.211200303

Jisuanji kexue (Feb 2023)

Visual Question Answering Model Based on Multi-modal Deep Feature Fusion

ZOU Yunzhu, DU Shengdong, TENG Fei, LI Tianrui

Affiliations

ZOU Yunzhu, DU Shengdong, TENG Fei, LI Tianrui: 1 Institute of Computer and Artificial Intelligence,Southwest Jiaotong University,Chengdu 611756,China;2 National Engineering Laboratory of Integrated Transportation Big Data Application Technology,Chengdu 611756,China

DOI: https://doi.org/10.11896/jsjkx.211200303
Journal volume & issue: Vol. 50, no. 2
pp. 123 – 129

Abstract

Read online

In the era of big data,with the explosive growth of multi-source heterogeneous data,multi-modal data fusion has attracted much attention of researchers,and visual question answering(VQA) has become a hot topic in multi-modal data fusion due to its image and text fusion processing characteristics.Visual Q&A task is mainly based on the deep feature fusion association and representation of image and text multi-modal data,and inference learning of the fusion feature results,so as to get the conclusion.Traditional visual question answering models tend to miss key information and mostly focus on the superficial modal feature association representation learning between data,but less on the deep semantic feature fusion.To solve the above pro-blems,this paper proposes a visual question answering model based on cross-modal deep interaction of of graphic features.The proposed method uses convolutional neural network and LSTM network to obtain the data features of image and text modes respectively,and builds a novel deep attention learning network based on combination of meta-attention units,to realize interactive learning of attention features within or between modes of image and text.At last,we represent the learning features so as to output the results.The model is tested and evaluated on VQA-v2.0 dataset.Compared with the traditional baseline model,the expe-rimental results show that the performance of the proposed model is significantly improved.

visual question answering|multi-modal feature fusion|attention mechanism|deep learning|data fusion

Published in Jisuanji kexue

ISSN: 1002-137X (Print)
Publisher: Editorial office of Computer Science
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software; Technology: Technology (General)
Website: http://www.jsjkx.com/CN/1002-137X/home.shtml

About the journal

Abstract

Keywords