IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)
PERS: Parameter-Efficient Multimodal Transfer Learning for Remote Sensing Visual Question Answering
Abstract
Remote sensing (RS) visual question answering (VQA) provides accurate answers through the analysis of RS images (RSIs) and associated questions. Recent research has increasingly adopted transformers for feature extraction. However, this trend leads to escalating training costs as a consequence of increased model sizes. Furthermore, existing studies predominantly employ transformers to extract features from a single modality, insufficiently integrating multimodal information and thereby undermining the potential advantages of transformers in feature extraction and fusion in these scenarios. To address these challenges, we propose parameter-efficient multimodal transfer learning for RSVQA. We introduce a lightweight, parameter-efficient adapter into the visual feature extraction module, initialized with weights pretrained on large-scale RSIs to reduce both training costs and parameters. A cross-attention mechanism is employed for multimodal interaction, enhancing the integration of information across modalities. Comprehensive experiments were conducted on three datasets: RSVQA-LR, RSVQA-HR, and RSVQAxBEN, achieving state-of-the-art performance. Moreover, exhaustive ablation studies demonstrate that our parameter-efficient adapter strategy achieves performance comparable to full-parameter training under partial parameter conditions, validating the efficacy of our approach.
Keywords