Spatio-Temporal Graph Convolution Transformer for Video Question Answering

Jiahao Tang; Jianguo Hu; Wenjun Huang; Shengzhi Shen; Jiakai Pan; Deming Wang; Yanyu Ding

doi:10.1109/ACCESS.2024.3445636

IEEE Access (Jan 2024)

Spatio-Temporal Graph Convolution Transformer for Video Question Answering

Jiahao Tang,
Jianguo Hu,
Wenjun Huang,
Shengzhi Shen,
Jiakai Pan,
Deming Wang,
Yanyu Ding

Affiliations

Jiahao Tang: ORCiD; School of Microelectronics Science and Technology, Sun Yat-sen University, Zhuhai, China
Jianguo Hu: ORCiD; School of Microelectronics Science and Technology, Sun Yat-sen University, Zhuhai, China
Wenjun Huang: School of Microelectronics Science and Technology, Sun Yat-sen University, Zhuhai, China
Shengzhi Shen: ORCiD; School of Microelectronics Science and Technology, Sun Yat-sen University, Zhuhai, China
Jiakai Pan: School of Microelectronics Science and Technology, Sun Yat-sen University, Zhuhai, China
Deming Wang: ORCiD; School of Electronics and Information Engineering, South China Normal University, Foshan, China
Yanyu Ding: ORCiD; Dongguan University of Technology, Dongguan, China

DOI: https://doi.org/10.1109/ACCESS.2024.3445636
Journal volume & issue: Vol. 12
pp. 131664 – 131680

Abstract

Read online

Currently, video question answering (VideoQA) algorithms relying on video-text pretraining models employ intricate unimodal encoders and multimodal fusion Transformers, which often lead to decreased efficiency in tasks such as visual reasoning. Conversely, video question answering algorithms based on graph neural networks often exhibit suboptimal performance in video description and reasoning, attributed to their simplistic graph construction and cross-modal interaction designs, necessitating additional pretraining data to mitigate these performance disparities. In this work, we introduce the Spatio-temporal Graph Convolution Transformer (STCT) model for VideoQA. By leveraging Spatio-temporal Graph Convolution (STGC) and dynamic graph Transformers, our model explicitly captures the spatio-temporal relationships among visual objects, thereby facilitating dynamic interactions and enhancing visual reasoning capabilities. Moreover, our model introduces a novel cross-modal interaction approach utilizing dynamic graph attention mechanisms to adjust the attention weights of visual objects based on the posed question, thereby augmenting multimodal cooperative perception. By addressing the limitations of graph-based algorithms dependent on pretraining for performance enhancement through meticulously designed graph structures and cross-modal interaction mechanisms, our model achieves superior performance in visual description and reasoning tasks with simpler unimodal encoders and multimodal fusion modules. Comprehensive analyses and comparisons of the model’s performance across multiple datasets, including NExT-QA, MSVD-QA, and MSRVTT-QA datasets, have confirmed its robust capabilities in video reasoning and description.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords