SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering

Feiqi Cao; Siwen Luo; Felipe Nunez; Zean Wen; Josiah Poon; Soyeon Caren Han

doi:10.3390/robotics12040114

Robotics (Aug 2023)

SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering

Feiqi Cao,
Siwen Luo,
Felipe Nunez,
Zean Wen,
Josiah Poon,
Soyeon Caren Han

Affiliations

Feiqi Cao: School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia
Siwen Luo: School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia
Felipe Nunez: School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia
Zean Wen: School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia
Josiah Poon: School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia
Soyeon Caren Han: School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia

DOI: https://doi.org/10.3390/robotics12040114
Journal volume & issue: Vol. 12, no. 4
p. 114

Abstract

Read online

Visual Question Answering (VQA) models fail catastrophically on questions related to the reading of text-carrying images. However, TextVQA aims to answer questions by understanding the scene texts in an image–question context, such as the brand name of a product or the time on a clock from an image. Most TextVQA approaches focus on objects and scene text detection, which are then integrated with the words in a question by a simple transformer encoder. The focus of these approaches is to use shared weights during the training of a multi-modal dataset, but it fails to capture the semantic relations between an image and a question. In this paper, we proposed a Scene Graph-Based Co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, the Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We create a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To permit explicit teaching of the relations between the two modalities, we propose and integrate two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conduct extensive experiments on two widely used benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperforms existing ones because of the scene graph and its attention modules.

Published in Robotics

ISSN: 2218-6581 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Mechanical engineering and machinery
Website: http://www.mdpi.com/journal/robotics

About the journal

Abstract

Keywords