RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

Yakoub Bazi; Laila Bashmal; Mohamad Mahmoud Al Rahhal; Riccardo Ricci; Farid Melgani

doi:10.3390/rs16091477

Remote Sensing (Apr 2024)

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

Yakoub Bazi,
Laila Bashmal,
Mohamad Mahmoud Al Rahhal,
Riccardo Ricci,
Farid Melgani

Affiliations

Yakoub Bazi: Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Laila Bashmal: Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Mohamad Mahmoud Al Rahhal: Applied Computer Science Department, College of Applied Computer Science, King Saud University, Riyadh 11543, Saudi Arabia
Riccardo Ricci: Department of Information Engineering and Computer Science, University of Trento, 38123 Trento, Italy
Farid Melgani: Department of Information Engineering and Computer Science, University of Trento, 38123 Trento, Italy

DOI: https://doi.org/10.3390/rs16091477
Journal volume & issue: Vol. 16, no. 9
p. 1477

Abstract

Read online

In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and visual question answering (VQA). In particular, we introduce an improved version of the Large Language and Vision Assistant Model (LLaVA), specifically adapted for RS imagery through a low-rank adaptation approach. To evaluate the model performance, we create the RS-instructions dataset, a comprehensive benchmark dataset that integrates four diverse single-task datasets related to captioning and VQA. The experimental results confirm the model’s effectiveness, marking a step forward toward the development of efficient multi-task models for RS image analysis.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords