Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering

Qifeng Li; Xinyi Tang; Yi Jian

doi:10.3390/s22041575

Sensors (Feb 2022)

Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering

Qifeng Li,
Xinyi Tang,
Yi Jian

Affiliations

Qifeng Li: Shanghai Institute of Technical Physics of the Chinese Academy of Sciences, Shanghai 200083, China
Xinyi Tang: Shanghai Institute of Technical Physics of the Chinese Academy of Sciences, Shanghai 200083, China
Yi Jian: Shanghai Institute of Technical Physics of the Chinese Academy of Sciences, Shanghai 200083, China

DOI: https://doi.org/10.3390/s22041575
Journal volume & issue: Vol. 22, no. 4
p. 1575

Abstract

Read online

Collaborative reasoning for knowledge-based visual question answering is challenging but vital and efficient in understanding the features of the images and questions. While previous methods jointly fuse all kinds of features by attention mechanism or use handcrafted rules to generate a layout for performing compositional reasoning, which lacks the process of visual reasoning and introduces a large number of parameters for predicting the correct answer. For conducting visual reasoning on all kinds of image–question pairs, in this paper, we propose a novel reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing these problems. In addition, our model consists of four neural module networks: the attention model that locates attended regions based on the image features and question embeddings by attention mechanism, the gated reasoning model that forgets and updates the fused features, the fusion reasoning model that mines high-level semantics of the attended visual features and knowledge base and knowledge-based fact model that makes up for the lack of visual and textual information with external knowledge. Therefore, our model performs visual analysis and reasoning based on tree structures, knowledge base and four neural module networks. Experimental results show that our model achieves superior performance over existing methods on the VQA v2.0 and CLVER dataset, and visual reasoning experiments prove the interpretability of the model.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords