RSMoDM: Multimodal Momentum Distillation Model for Remote Sensing Visual Question Answering

Pengfei Li; Gang Liu; Jinlong He; Xiangxu Meng; Shenjun Zhong; Xun Chen

doi:10.1109/JSTARS.2024.3419035

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

RSMoDM: Multimodal Momentum Distillation Model for Remote Sensing Visual Question Answering

Pengfei Li,
Gang Liu,
Jinlong He,
Xiangxu Meng,
Shenjun Zhong,
Xun Chen

Affiliations

Pengfei Li: ORCiD; College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Gang Liu: ORCiD; College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Jinlong He: ORCiD; College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Xiangxu Meng: ORCiD; College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Shenjun Zhong: ORCiD; Monash Biomedical Imaging, Monash University, Clayton, VIC, Australia
Xun Chen: ORCiD; Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, China

DOI: https://doi.org/10.1109/JSTARS.2024.3419035
Journal volume & issue: Vol. 17
pp. 16799 – 16814

Abstract

Read online

Remote sensing (RS) visual question answering (VQA) is a task that answers questions about a given RS image by utilizing both image and textual information. However, existing methods in RS VQA overlook the fact that the ground truths in RS VQA benchmark datasets, which are algorithmically generated rather than manually annotated, may not always represent the most reasonable answers to the questions. In this article, we propose a multimodal momentum distillation model (RSMoDM) for RS VQA tasks. Specifically, we maintain the momentum distillation model during the training stage that generates stable and reliable pseudolabels for additional supervision, effectively preventing the model from being penalized for producing other reasonable outputs that differ from ground truth. Additionally, to address domain shift in RS, we employ the Vision Transformer (ViT) trained on a large-scale RS dataset for enhanced image feature extraction. Moreover, we introduce the multimodal fusion module with cross-attention for improved cross-modal representation learning. Our extensive experiments across three different RS VQA datasets demonstrate that RSMoDM achieves state-of-the-art performance, particularly excelling in scenarios with limited training data. The strong interpretability of our method is further evidenced by visualized attention maps.

Published in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ISSN: 1939-1404 (Print); 2151-1535 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Ocean engineering; Science: Physics: Geophysics. Cosmic physics
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=4609443

About the journal

Abstract

Keywords