Multiscale Feature Extraction and Fusion of Image and Text in VQA

Siyu Lu; Yueming Ding; Mingzhe Liu; Zhengtong Yin; Lirong Yin; Wenfeng Zheng

doi:10.1007/s44196-023-00233-6

International Journal of Computational Intelligence Systems (Apr 2023)

Multiscale Feature Extraction and Fusion of Image and Text in VQA

Siyu Lu,
Yueming Ding,
Mingzhe Liu,
Zhengtong Yin,
Lirong Yin,
Wenfeng Zheng

Affiliations

Siyu Lu: School of Automation, University of Electronic Science and Technology of China
Yueming Ding: School of Automation, University of Electronic Science and Technology of China
Mingzhe Liu: School of Data Science and Artificial Intelligence, Wenzhou University of Technology
Zhengtong Yin: College of Resource and Environment Engineering, Guizhou University
Lirong Yin: Department of Geography and Anthropology, Louisiana State University
Wenfeng Zheng: School of Automation, University of Electronic Science and Technology of China

DOI: https://doi.org/10.1007/s44196-023-00233-6
Journal volume & issue: Vol. 16, no. 1
pp. 1 – 11

Abstract

Read online

Abstract The Visual Question Answering (VQA) system is the process of finding useful information from images related to the question to answer the question correctly. It can be widely used in the fields of visual assistance, automated security surveillance, and intelligent interaction between robots and humans. However, the accuracy of VQA has not been ideal, and the main difficulty in its research is that the image features cannot well represent the scene and object information, and the text information cannot be fully represented. This paper used multi-scale feature extraction and fusion methods in the image feature characterization and text information representation sections of the VQA system, respectively to improve its accuracy. Firstly, aiming at the image feature representation problem, multi-scale feature extraction and fusion method were adopted, and the image features output of different network layers were extracted by a pre-trained deep neural network, and the optimal scheme of feature fusion method was found through experiments. Secondly, for the representation of sentences, a multi-scale feature method was introduced to characterize and fuse the word-level, phrase-level, and sentence-level features of sentences. Finally, the VQA model was improved using the multi-scale feature extraction and fusion method. The results show that the addition of multi-scale feature extraction and fusion improves the accuracy of the VQA model.

Published in International Journal of Computational Intelligence Systems

ISSN: 1875-6891 (Print); 1875-6883 (Online)
Publisher: Springer
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.springer.com/journal/44196

About the journal

Abstract

Keywords