PERS: Parameter-Efficient Multimodal Transfer Learning for Remote Sensing Visual Question Answering

Jinlong He; Gang Liu; Pengfei Li; Xiaonan Su; Wenhua Jiang; Dongze Zhang; Shenjun Zhong

doi:10.1109/JSTARS.2024.3447086

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

PERS: Parameter-Efficient Multimodal Transfer Learning for Remote Sensing Visual Question Answering

Jinlong He,
Gang Liu,
Pengfei Li,
Xiaonan Su,
Wenhua Jiang,
Dongze Zhang,
Shenjun Zhong

Affiliations

Jinlong He: ORCiD; College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Gang Liu: ORCiD; College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Pengfei Li: ORCiD; College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Xiaonan Su: College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Wenhua Jiang: College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Dongze Zhang: College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Shenjun Zhong: ORCiD; Monash Biomedical Imaging, Monash University, Melbourne, VIC, Australia

DOI: https://doi.org/10.1109/JSTARS.2024.3447086
Journal volume & issue: Vol. 17
pp. 14823 – 14835

Abstract

Read online

Remote sensing (RS) visual question answering (VQA) provides accurate answers through the analysis of RS images (RSIs) and associated questions. Recent research has increasingly adopted transformers for feature extraction. However, this trend leads to escalating training costs as a consequence of increased model sizes. Furthermore, existing studies predominantly employ transformers to extract features from a single modality, insufficiently integrating multimodal information and thereby undermining the potential advantages of transformers in feature extraction and fusion in these scenarios. To address these challenges, we propose parameter-efficient multimodal transfer learning for RSVQA. We introduce a lightweight, parameter-efficient adapter into the visual feature extraction module, initialized with weights pretrained on large-scale RSIs to reduce both training costs and parameters. A cross-attention mechanism is employed for multimodal interaction, enhancing the integration of information across modalities. Comprehensive experiments were conducted on three datasets: RSVQA-LR, RSVQA-HR, and RSVQAxBEN, achieving state-of-the-art performance. Moreover, exhaustive ablation studies demonstrate that our parameter-efficient adapter strategy achieves performance comparable to full-parameter training under partial parameter conditions, validating the efficacy of our approach.

Published in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ISSN: 1939-1404 (Print); 2151-1535 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Ocean engineering; Science: Physics: Geophysics. Cosmic physics
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=4609443

About the journal

Abstract

Keywords