Contrastive training of a multimodal encoder for medical visual question answering

João Daniel Silva; Bruno Martins; João Magalhães

Intelligent Systems with Applications (May 2023)

Contrastive training of a multimodal encoder for medical visual question answering

João Daniel Silva,
Bruno Martins,
João Magalhães

Affiliations

João Daniel Silva: INESC-ID and Instituto Superior Tecnico, University of Lisbon, Lisbon, Portugal; Corresponding author.
Bruno Martins: INESC-ID and Instituto Superior Tecnico, University of Lisbon, Lisbon, Portugal
João Magalhães: Faculty of Science and Technology, Universidade NOVA de Lisboa, Lisbon, Portugal

Journal volume & issue: Vol. 18
p. 200221

Abstract

Read online

Models for Visual Question Answering (VQA) on medical images aim to answer diagnostically relevant natural language questions with basis on visual contents. In this article, we propose a novel approach to address this problem, which combines a strong image encoder based on EfficientNetV2 with a multimodal encoder based on the RealFormer architecture. Our model is pre-trained through a strategy that includes a contrastive objective, and the final fine-tuning to the VQA task uses a loss function that specifically addresses class imbalance. The experimental results confirm the effectiveness of our approach on the VQA-Med dataset from ImageCLEF 2019, showcasing the potential benefits of combining multimodal pre-training with recent advances in terms of neural network architectures.

Published in Intelligent Systems with Applications

ISSN: 2667-3053 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Science (General): Cybernetics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.journals.elsevier.com/intelligent-systems-with-applications

About the journal

Abstract

Keywords